tazjin / kubernetes-letsencrypt

A Kubernetes controller to retrieve Let's Encrypt certificates based on service annotations (unmaintained)
MIT License
115 stars 14 forks source link

Error creating new authz :: too many currently pending authorizations #72

Open drigz opened 7 years ago

drigz commented 7 years ago

Using kubernetes-letsencrypt v1.7 with Cloud DNS and GKE, we've observed a "too many currently pending authorizations" error. This is surprising, since the limit is 300 pending authorizations, but we only have ~10 certificates on the domain. kubernetes-letsencrypt was previously working fine, but when a new team member tried to bring up their own cluster, they ran into this issue.

On the Let's Encrypt forums, schoen said:

So I think the likeliest interpretation is [...] it sometimes request an authorization and then not use it (either requesting an authorization when not requesting a certificate, or requesting an authorization and then crashing or exiting before the corresponding certificate can be requested). This could, for example, be a renewal-related bug if one part of the code says "this certificate should be renewed now" but another part of the code says "this certificate is not yet due for renewal".

and

Maybe this does lead to some useful guidance for client developers: if you get an authz for one requested domain but fail to get it for another, make sure you proactively destroy the first authz before giving up. (If your error was based on repeated failed attempts to get a certificate for a mixture of names you do and don't control, that might be the underlying problem here.)

Is that possible? If we see it again, what can we do to get more debug information?

org.shredzone.acme4j.exception.AcmeRateLimitExceededException: Error creating new authz :: too many currently pending authorizations
        at org.shredzone.acme4j.connector.DefaultConnection.createAcmeException(DefaultConnection.java:394)
        at org.shredzone.acme4j.connector.DefaultConnection.accept(DefaultConnection.java:199)
        at org.shredzone.acme4j.Registration.authorizeDomain(Registration.java:189)
        at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler.getAuthorization(CertificateRequestHandler.kt:90)
        at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler.authorizeDomain(CertificateRequestHandler.kt:68)
        at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler.access$authorizeDomain(CertificateRequestHandler.kt:27)
        at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler$requestCertificate$1.accept(CertificateRequestHandler.kt:41)
        at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler$requestCertificate$1.accept(CertificateRequestHandler.kt:27)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
        at java.util.Collections$2.tryAdvance(Collections.java:4717)
        at java.util.Collections$2.forEachRemaining(Collections.java:4725)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
        at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291)
        at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
        at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
        at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401)
        at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
        at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160)
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
        at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583)
        at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler.requestCertificate(CertificateRequestHandler.kt:41)
        at in.tazj.k8s.letsencrypt.kubernetes.ServiceManager.handleCertificateRequest(ServiceManager.kt:64)
        at in.tazj.k8s.letsencrypt.kubernetes.ServiceManager.access$handleCertificateRequest(ServiceManager.kt:20)
        at in.tazj.k8s.letsencrypt.kubernetes.ServiceManager$reconcileService$1.run(ServiceManager.kt:45)
        at java.lang.Thread.run(Thread.java:745)
drigz commented 7 years ago

I've looked in the logs for the kubernetes-letsencrypt and noticed two things.

One: the CloudDnsResponder threw an exception early on:

Exception in thread "Thread-2" java.lang.UnsupportedOperationException: Empty collection can't be reduced.
    at in.tazj.k8s.letsencrypt.acme.CloudDnsResponder.findMatchingZone(CloudDnsResponder.kt:123)
    at in.tazj.k8s.letsencrypt.acme.CloudDnsResponder.updateCloudDnsRecord(CloudDnsResponder.kt:55)
    at in.tazj.k8s.letsencrypt.acme.CloudDnsResponder.addChallengeRecord(CloudDnsResponder.kt:26)
    at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler.prepareDnsChallenge(CertificateRequestHandler.kt:176)
    at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler.authorizeDomain(CertificateRequestHandler.kt:77)
    at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler.access$authorizeDomain(CertificateRequestHandler.kt:27)
    at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler$requestCertificate$1.accept(CertificateRequestHandler.kt:41)
    at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler$requestCertificate$1.accept(CertificateRequestHandler.kt:27)
    [SNIP: java.util.stream.*]
    at in.tazj.k8s.letsencrypt.acme.CertificateRequestHandler.requestCertificate(CertificateRequestHandler.kt:41)
    at in.tazj.k8s.letsencrypt.kubernetes.ServiceManager.handleCertificateRequest(ServiceManager.kt:64)
    at in.tazj.k8s.letsencrypt.kubernetes.ServiceManager.access$handleCertificateRequest(ServiceManager.kt:20)
    at in.tazj.k8s.letsencrypt.kubernetes.ServiceManager$reconcileService$1.run(ServiceManager.kt:45)
    at java.lang.Thread.run(Thread.java:745)

This appears to be because our Cloud DNS configuration had the wrong zone, so the responder didn't work.

Two: this error occurs 300 times before the rate limit error takes its place. This takes about an hour because the operation is retried very frequently. The retries continue, leading to rate limit errors every 45 seconds or so.

Two things that could help this:

tazjin commented 7 years ago

Thanks for reporting this, I'll look into handling this more gracefully!

drigz commented 7 years ago

Thanks! FYI, as a workaround, we deleted the letsencrypt-keypair secret. This makes kubernetes-letsencrypt create a new user with an empty quota.

kubectl --namespace kube-system delete secret letsencrypt-keypair

drigz commented 7 years ago

Note: LE just enabled pending authorization recycling, which might (help) avoid this issue:

https://community.letsencrypt.org/t/automatic-recycling-of-pending-authorizations/41321

tazjin commented 7 years ago

Interesting! I started working on the issues you reported yesterday - but time is currently a scarce resource :-)