Default MAX_ELAPSED_TIME to 24h.

jetstack-bot commented 6 years ago

Hi @jsha. Thanks for your PR.

I'm waiting for a jetstack member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/devel/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

munnerz commented 6 years ago

So after digging into this a bit more, I don't think this will achieve quite what we want.

kube-lego doesn't actually use a rate limited workqueue when reading work items off its internal queue. This means as soon as a hard failure has occurred, it will be retried within a couple of seconds.

The rate limit that's being changed here, is used here: https://github.com/jetstack/kube-lego/blob/10b552655035a56cab47cb3a74a52fc113c53a6b/pkg/acme/cert_request.go#L142

As you can see, adjusting this backoff will actually just cause us to run 'verifyAuthz' at greater intervals. This will help slow down requests, but at the cost of blocking up processing of all other resources.

Instead, I think we should not accept this change, and instead convert the workqueue used internally to be a rate limited queue, and set the parameters you've put forward in this PR here, on that queue instead.

I'll create a PR with that change and we can go from there.

jsha commented 6 years ago

Thanks for the reply! I read up a little on the backoff package, and I realized its behavior is not what I expected. For instance, with a MAX_ELAPSED_TIME of 24h, it will back off until it reaches 24 hours, and then stop, which causes Acme.ObtainCertificate to return an error, which in turn causes KubeLego.processProvider to proceed to the next iteration of the loop; then the next loop attempt will start at the min backoff value again.

This isn't the behavior we'd like. What we'd like is for clients to back off until they hit 24 hours, and then remember that backoff interval until they get a success. In other words, we have many many clients that fail every time they attempt issuance. We want those clients to never try more than once every 24 hours.

I agree this isn't the right fix. It seems like what's really needed is for KubeLego.processProvider to have a separate backoff for each Provider, and remember that backoff over time.

As a stopgap solution (because this is a time-sensitive issue for us), I'm going to submit a PR that just adds a plain sleep to KubeLego.processProvider for now.

jetstack / kube-lego

Default MAX_ELAPSED_TIME to 24h. #326