remilapeyre / vault-acme

Mozilla Public License 2.0
94 stars 24 forks source link

Code: 500. Errors: * rpc error: code = Canceled desc = context canceled #30

Closed gardar closed 2 years ago

gardar commented 2 years ago

I'm having some issues getting the plugin to run properly, this is the error message I get when writing to acme/certs

$ vault write acme/certs/example.com common_name=subdomain.example.com
Error writing data to acme/certs/example.com: Error making API request.

URL: PUT https://vault.example.com/v1/acme/certs/example.com
Code: 500. Errors:

* 1 error occurred:
        * rpc error: code = Canceled desc = context canceled

I suspect this error has something to do with the dns propagation. I'm using the cloudflare provider and I've tried various settings but I keep getting stuck with this error, the only success I've had is when I did the following:

  1. Set the account provider_configuration to CLOUDFLARE_TTL:600
  2. Ran vault write acme/certs/... (which failed with the above error)
  3. Set ignore_dns_propagation=true for the account
  4. Ran vault write acme/certs/... again which now successfully returns a certificate.

I've tried lowering the CLOUDFLARE_POLLING_INTERVAL and raising the CLOUDFLARE_PROPAGATION_TIMEOUT but the result is the same. And if I run the initial request with ignore_dns_propagation=true I receive this error from acme:

Error writing data to acme/certs/example.com: Error making API request.

URL: PUT https://vault.example.com/v1/acme/certs/example.com
Code: 400. Errors:

* Failed to validate certificate signing request: error: one or more domains had a problem:
[subdomain.example.com] acme: error: 400 :: urn:ietf:params:acme:error:dns :: DNS problem: NXDOMAIN looking up TXT for _acme-challenge.subdomain.example.com - check that a DNS
record exists for this domain, url:

I'm not sure if the plugin is doing it's own dns lookups to upstream servers or if it's using the ones from the OS on the vault server but I made sure caching is disabled in systemd-resolved on the vault server and even tried switching the OS dns over to 1.1.1.1

I set the vault log level to debug but the vault logs don't seem to provide anything useful.

Mar 23 19:31:30 vault.example.com sh[44077]: 2022-03-23T19:31:30.462Z [INFO]  secrets.acme.acme_04e38bd1.acme.acme-plugin: Updating account: timestamp=2022-03-23T19:31:30.462Z                                                                                                                                           
Mar 23 19:31:30 vault.example.com sh[44077]: 2022-03-23T19:31:30.794Z [INFO]  secrets.acme.acme_04e38bd1.acme.acme-plugin: Saving account: timestamp=2022-03-23T19:31:30.794Z                                                                                                                                             
Mar 23 19:31:42 vault.example.com sh[44077]: 2022-03-23T19:31:42.467Z [DEBUG] secrets.acme.acme_04e38bd1.acme.acme-plugin: Validate names: names=[subdomain.example.com] role="map[Account:example AllowBareDomains:true AllowSubdomains:true AllowedDomains:[example.com] CacheForRatio:70 DisableCache:false]" timestamp=2022-03-23T19:31:42.467Z                                                                                                                                                      
Mar 23 19:31:42 vault.example.com sh[44077]: 2022-03-23T19:31:42.468Z [DEBUG] secrets.acme.acme_04e38bd1.acme.acme-plugin: Look in the cache for a saved cert: timestamp=2022-03-23T19:31:42.468Z                                                                                                                         
Mar 23 19:31:42 vault.example.com sh[44077]: 2022-03-23T19:31:42.468Z [DEBUG] secrets.acme.acme_04e38bd1.acme.acme-plugin: Certificate not found in the cache: timestamp=2022-03-23T19:31:42.468Z                                                                                                                         
Mar 23 19:31:42 vault.example.com sh[44077]: 2022-03-23T19:31:42.468Z [DEBUG] secrets.acme.acme_04e38bd1.acme.acme-plugin: Contacting the ACME provider to get a new certificate: timestamp=2022-03-23T19:31:42.468Z                                                                                                      

I've noticed that the failure always seems to occur after 90 seconds, no matter how I configure CLOUDFLARE_POLLING_INTERVAL, CLOUDFLARE_PROPAGATION_TIMEOUT, CLOUDFLARE_TTL, CLOUDFLARE_HTTP_TIMEOUT and even when setting VAULT_CLIENT_TIMEOUT=300s. Can I somehow increase this timeout or configure a retry value or something? Although I'm not sure if that would fix the issue.

gardar commented 2 years ago

I think I've figured this out.

For the dns propagation check vault-acme (well lego actually) does multiple dns lookups. First it tries the dns server configured in the OS, then it does a lookup to the authoritative dns server(s) configured for the domain (cloudflare in my case). In my case outbound dns requests are only allowed to the dns servers that are configured in the os. This causes requests to the authoritative dns server to fail, even though the os dns server resolves the acme-challenge record just fine.

The lego cli has an option for overriding the nameserver --dns.resolvers value Set the resolvers to use for performing recursive DNS queries. and digging through the vault-acme code I found that the same function is actually implemented in the form of a env var. https://github.com/remilapeyre/vault-acme/blob/e109bbe090072c41d5ca5c86f683ae4399d2213c/acme/client.go#L42-L46

Tried setting that environment variable to the same nameserver as I have configured in the os on the vault server and now everything works as expected!

/lib/systemd/system/vault.service

[Service]
Environment="LEGO_TEST_NAMESERVER=8.8.8.8:53"

It might be good to document this somewhere or perhaps allow setting the nameserver option in the acme account in addition to the env var.

remilapeyre commented 2 years ago

Thanks for finding the root cause of this issue @gardar ! I will add a new parameter so that this can be set using the API rather than having to use an environment variable.

Note that you will have to update your configuration when updating the plugin to the next version.

gardar commented 2 years ago

No problem, glad I could help!