go-acme / lego

Let's Encrypt/ACME client and library written in Go
https://go-acme.github.io/lego/
MIT License
8.03k stars 1.02k forks source link

gcloud provider needs workaround for inconsistent nameserver results #770

Open dhduvall opened 5 years ago

dhduvall commented 5 years ago

I've filed an issue with Google (https://issuetracker.google.com/issues/123397631) but lego probably needs a workaround for the problem. The summary is that even once all of a domain's nameservers have responded with the correct data (and thus triggering a successful result of lego's pre-check routines), one or more of the nameservers may revert to responding with old data or NXDOMAIN. They eventually settle down after a (potentially unbounded?) amount of time.

I'm not sure what the best way is of adding extra time to the pre-check method. Because checkDNSPropagation() isn't exported, I can't simply create a pre-check function that calls it first, then waits (or continues to check for a while). Simply trying again isn't a great option, since there's no way to deactivate the authorization from this side of the API, as (as best I can tell) there's no access to the authorization URI in the error you get back from Obtain() and no way to create it from the ObtainRequest, and I accidentally ended up maxing out the authorizations rate limit figuring this out (thankfully I kept the logs containing the URIs).

I'm happy to put a fix together, but would appreciate some direction.

tjhiggins commented 4 years ago

@dhduvall Are you still having this issue? Curious if you think its related to: https://github.com/go-acme/lego/issues/1087

dhduvall commented 4 years ago

It might be: lego could be seeing success after those 50 seconds, based on the individual servers it happens to have gotten to respond to its request, but when LE checks, it ends up with servers that don't have the new data yet, and ends up failing the request.

That said, IIRC, I rarely saw lego think the propagation was complete before the timeout. Maybe GCP's architecture is a bit different now, that this is happening?

I haven't seen it myself since I added the workaround with WrapPreCheck().

And I'm still mystified as to why people don't see this when running certbot against GCP. Maybe most people aren't doing the DNS verification.