Closed gardar closed 2 years ago
I think I've figured this out.
For the dns propagation check vault-acme (well lego actually) does multiple dns lookups. First it tries the dns server configured in the OS, then it does a lookup to the authoritative dns server(s) configured for the domain (cloudflare in my case). In my case outbound dns requests are only allowed to the dns servers that are configured in the os. This causes requests to the authoritative dns server to fail, even though the os dns server resolves the acme-challenge record just fine.
The lego cli has an option for overriding the nameserver
--dns.resolvers value Set the resolvers to use for performing recursive DNS queries.
and digging through the vault-acme code I found that the same function is actually implemented in the form of a env var.
https://github.com/remilapeyre/vault-acme/blob/e109bbe090072c41d5ca5c86f683ae4399d2213c/acme/client.go#L42-L46
Tried setting that environment variable to the same nameserver as I have configured in the os on the vault server and now everything works as expected!
/lib/systemd/system/vault.service
[Service]
Environment="LEGO_TEST_NAMESERVER=8.8.8.8:53"
It might be good to document this somewhere or perhaps allow setting the nameserver option in the acme account in addition to the env var.
Thanks for finding the root cause of this issue @gardar ! I will add a new parameter so that this can be set using the API rather than having to use an environment variable.
Note that you will have to update your configuration when updating the plugin to the next version.
No problem, glad I could help!
I'm having some issues getting the plugin to run properly, this is the error message I get when writing to acme/certs
I suspect this error has something to do with the dns propagation. I'm using the cloudflare provider and I've tried various settings but I keep getting stuck with this error, the only success I've had is when I did the following:
CLOUDFLARE_TTL:600
vault write acme/certs/...
(which failed with the above error)ignore_dns_propagation=true
for the accountvault write acme/certs/...
again which now successfully returns a certificate.I've tried lowering the
CLOUDFLARE_POLLING_INTERVAL
and raising theCLOUDFLARE_PROPAGATION_TIMEOUT
but the result is the same. And if I run the initial request withignore_dns_propagation=true
I receive this error from acme:I'm not sure if the plugin is doing it's own dns lookups to upstream servers or if it's using the ones from the OS on the vault server but I made sure caching is disabled in systemd-resolved on the vault server and even tried switching the OS dns over to 1.1.1.1
I set the vault log level to debug but the vault logs don't seem to provide anything useful.
I've noticed that the failure always seems to occur after 90 seconds, no matter how I configure
CLOUDFLARE_POLLING_INTERVAL
,CLOUDFLARE_PROPAGATION_TIMEOUT
,CLOUDFLARE_TTL
,CLOUDFLARE_HTTP_TIMEOUT
and even when settingVAULT_CLIENT_TIMEOUT=300s
. Can I somehow increase this timeout or configure a retry value or something? Although I'm not sure if that would fix the issue.