letsencrypt / pebble

A miniature version of Boulder, Pebble is a small RFC 8555 ACME test server not suited for a production certificate authority.
Mozilla Public License 2.0
634 stars 152 forks source link

System DNS resolver caching fails DNS-01 challenges #118

Closed munnerz closed 6 years ago

munnerz commented 6 years ago

After fixing up some e2e tests in cert-manager, I've realised Pebble is not validating DNS01 domains correctly.

My suspicious is this is caused by the use of net.LookupTXT in va: https://github.com/letsencrypt/pebble/blob/master/va/va.go#L270

From what I understand, boulder will pick a random authoritative nameserver to query against when performing the DNS01 validation. I have designed cert-manager to therefore check each authoritative NS for the expected TXT record before actually accepting the challenge.

I appear to be hitting L272 (https://github.com/letsencrypt/pebble/blob/master/va/va.go#L272), even after waiting an additional 60s after the DNS record has propagated to all authoritative nameservers.

Am I correct in thinking we shouldn't expect recursive servers to be up to date? This sounds very difficult to achieve cleanly 😃

FWIW, I have changed the e2e test to use the letsencrypt staging endpoint, and it passes fine (hence my suspicions!)

shred commented 6 years ago

If it is any help: I run integration tests against Pebble in my project, and my dns-01 tests succeed. So generally, it seems to work. However I am using a tiny and very simple DNS server, nothing sophisticated.

munnerz commented 6 years ago

Is that DNS server also authoritative for your hostname under test?

I'm using system default recursive nameservers during my tests, and updating a real cloudflare dns zone in order to perform validations. I suspect this is the issue, as I'll be using some random recursive DNS resolvers that are honouring the NXDOMAIN ttl (as the domain I am using is a randomly generated subdomain).

shred commented 6 years ago

I run Pebble in a docker container, and my test DNS server in a second docker container. The /etc/resolv.conf of Pebble's container points to that DNS server's IP, so it is the only way for Pebble to resolve domains. This way I can also run http-01 validations for fake domains like example.com.

The server only does a minimal job. It reacts to A and TXT queries, and sends the responses that I have previously set. No TTL, no recursions.

I only mention this so you know that Pebble's dns-01 validation is not generally broken, but I don't want to rule out that there might be an issue.

cpu commented 6 years ago

I'm using system default recursive nameservers during my tests, and updating a real cloudflare dns zone in order to perform validations. I suspect this is the issue, as I'll be using some random recursive DNS resolvers that are honouring the NXDOMAIN ttl (as the domain I am using is a randomly generated subdomain).

Yup, that's the issue :-) I think we should do a better job of documenting this Pebble gotcha.

Boulder and the Let's Encrypt prod/staging stack use an Unbound instance to do the heavy lifting for DNS. We run a configuration (basically identical to this) that sets a very low max TTL to avoid caching problems for those environments. Boulder's test environment uses a fake recursive resolver that returns fibs. In both cases Boulder uses miekg/dns to talk to the specifically configured resolver (The fake one or the Unbound instance).

Ideally Pebble could be changed to do similar: config would point Pebble's DNS requests to a fake or otherwise customized recursive DNS server. @shred and I chatted about that way back in Jul 2017 in #33. Unfortunately my conclusion at the time was that it would mean pulling in miekg/dns to Pebble and doing a lot more custom DNS code. Presently (as you noted) Pebble uses net.LookupTXT from the stdlib and Go uses the system DNS resolver unconditionally.

I think a solution like what @shred arrived at where you find a way to configure the system DNS for your integration tests unobtrusively with :sparkles: Container Magic :sparkles: is the best path forward. (edit: at least for the short term until there's time for more involved Pebble DNS rework).

cpu commented 6 years ago

I put out a PR to clarify some of Pebble's limitations, including this system DNS resolver "gotcha": https://github.com/letsencrypt/pebble/pull/123

I'm going to close this issue for now since the problem is a known limitation with Pebble. I'll leave https://github.com/letsencrypt/pebble/issues/33 open for tracking more intensive work to integrate more complex DNS handling.

Thanks!