ory / hydra

The most scalable and customizable OpenID Certified™ OpenID Connect and OAuth Provider on the market. Become an OpenID Connect and OAuth2 Provider over night. Broad support for related RFCs. Written in Go, cloud native, headless, API-first. Available as a service on Ory Network and for self-hosters.
https://www.ory.sh/?utm_source=github&utm_medium=banner&utm_campaign=hydra
Apache License 2.0
15.66k stars 1.5k forks source link

Watching tls cert/key file changes to reload the server automatically #2850

Closed StarAurryon closed 1 year ago

StarAurryon commented 3 years ago

Preflight checklist

Describe your problem

When using hydra within kubernetes and with certmanager the TLS certificates are only valid for 90 days. After 60 days cert manager renew the cert and update the files in the Pod.

I have not found anything in the doc.

Describe your ideal solution

Hydra (but also Kratos) could watch the TLS key and cert files if defined in the config file for change and restart the http server if needed.

Workarounds or alternatives

Restarting the service with an external supervisor.

Version

All

Additional Context

If there is no solution, as always I can provide a PR ;)

aeneasr commented 3 years ago

PRs welcomed :) Technically, I think this probably needs a HTTP server restart which might be difficult (but probably achievable) to implement given the way how we bootstrap.

It will need a lot of tests to ensure the restart happens cleanly (e.g. no left-over config instances, no left-over FS watches, etc).

zepatrik commented 3 years ago

One could also not re-initialize everything, but instead keep the registry and just start a new http.Server that uses the same handlers? That results still in a short down-time I assume, but if you instead keep the same net.Socket and only create a new server it could even happen with zero downtime? Definitely needs some smart ideas.

StarAurryon commented 3 years ago

That results still in a short down-time

I don't know how go handle the closing of the http server. I hope that it finalizes current requests properly. After that clients might have their tls session dropped as the key/cert will be changed. I don't know how this is handled with HTTP2 as multiple requests can use the same TCP session, if i am not wrong.

If you instead keep the same net.Socket

I believe that's the way to go as you don't have to check again if the port is already bound and if Kubernetes check it at a wrong time the service could be considered as down. Its better to keep the port open and delay the request acceptance.

aeneasr commented 3 years ago

Maybe worth a look: https://www.reddit.com/r/golang/comments/9go99a/renewing_ssl_certificates_without_restarting_go/

StarAurryon commented 2 years ago

I have made some research. In my opinion the best way to implement this is to use tls.Config GetCertificate func(ClientHelloInfo) (Certificate, error) https://pkg.go.dev/crypto/tls#Config.

Basically each time a new TLS session is initiated this function is called.

What I propose is to:

  1. Load the cert + key files at startup and crash if it's not ok.
  2. Watch the certificate file with iNotify mechanism.
  3. If changed test it to verify that it is a valid file cert + key files or throw an error message.
  4. Replace the certificate provided by the GetCertificate session if 3 was ok.
github-actions[bot] commented 1 year ago

Hello contributors!

I am marking this issue as stale as it has not received any engagement from the community or maintainers for a year. That does not imply that the issue has no merit! If you feel strongly about this issue

Throughout its lifetime, Ory has received over 10.000 issues and PRs. To sustain that growth, we need to prioritize and focus on issues that are important to the community. A good indication of importance, and thus priority, is activity on a topic.

Unfortunately, burnout has become a topic of concern amongst open-source projects.

It can lead to severe personal and health issues as well as opening catastrophic attack vectors.

The motivation for this automation is to help prioritize issues in the backlog and not ignore, reject, or belittle anyone.

If this issue was marked as stale erroneously you can exempt it by adding the backlog label, assigning someone, or setting a milestone for it.

Thank you for your understanding and to anyone who participated in the conversation! And as written above, please do participate in the conversation if this topic is important to you!

Thank you 🙏✌️