DNS certificates with many names: Faster verification

linsomniac commented 1 month ago

Welcome

[X] Yes, I've searched similar issues on GitHub and didn't find any.

How do you use lego?

Binary

Detailed Description

I'm using the CLI and AWS route53 provider, on a certificate with 46 names, against the LetsEncrypt staging endpoint. It's taking 75 minutes to request the cert.

Looks like it loops over like this:

2024/05/06 11:48:23 [INFO] [foo.example.com] acme: Preparing to solve DNS-01
2024/05/06 11:48:24 [INFO] Wait for route53 [timeout: 2m0s, interval: 1s]
2024/05/06 11:48:58 [INFO] [bar.example.com] acme: Preparing to solve DNS-01
[45 more domains]
2024/05/06 12:17:26 [INFO] [foo.example.com] acme: Trying to solve DNS-01
2024/05/06 12:17:26 [INFO] [foo.example.com] acme: Checking DNS record propagation using [8.8.8.8:53]
2024/05/06 12:17:27 [INFO] Wait for propagation [timeout: 2m0s, interval: 1s]
2024/05/06 12:17:44 [INFO] [foo.example.com] The server validated our request
2024/05/06 12:17:44 [INFO] [bar.example.com] acme: Trying to solve DNS-01
[45 more domains]
2024/05/06 12:32:06 [INFO] [foo.example.com] acme: Cleaning DNS-01 challenge
2024/05/06 12:32:07 [INFO] Wait for route53 [timeout: 2m0s, interval: 1s]
2024/05/06 12:32:53 [INFO] [bar.example.com] acme: Cleaning DNS-01 challenge
[45 more domains]
2024/05/06 13:04:02 [INFO] [foo.example.com, bar.example.com, ...] acme: Validations succeeded; requesting certificates
2024/05/06 13:04:02 [INFO] Wait for certificate [timeout: 30s, interval: 500ms]
2024/05/06 13:04:03 [INFO] [www.example.com] Server responded with a certificate.

In my case, those domains are in 4-8 different zones

Previously to lego we were using certbot via http and it could create the certs in, by memory, a minute or less. I realize that HTTP is different from Route53.

It seems like this loop is going over each domain and adding the validation name, then waiting for propagation, and doing similar for removing the record. Is there a reason it does this, rather than looping over the domains, adding the TXT records for all of them, THEN looping over them checking for propagation (so they can all propagate in parallel), and similarly for the removal?

ldez commented 1 month ago

Hello,

From my memory, certbot doesn't check the propagation.

The current algo:

create all the TXT records
clean all the TXT records

The propagation check (Wait for propagation) is here because some DNS providers are slow to propagate.

The Wait for route53 is required because route53 doesn't apply changes immediately, if we don't check that we will add and remove a record simultaneously.

You can try to disable the propagation check (Wait for propagation) with --dns.disable-cp flag.

linsomniac commented 1 month ago

I did a run with "--dns.disable-cp" last night and it took basically the same length of time (15:57 to 17:15, 78 minutes).

Maybe my thinking is unrealistic, but it seems like it should be able to be done faster, 30 seconds per domain to update DNS seems pretty long, but the primary thing is that it's reliable, which it does seem to be. Usually it just runs from cron, so it's not even visible, but if we do a full respin, especially in an emergency, that would be an instance where faster would be nice.

ldez commented 1 month ago

So maybe the slow part should be the Wait for route53 :thinking:

Can you provide the full log?

linsomniac commented 1 month ago

Sent a full log to you on twitter/x

ldez commented 1 month ago

Based on the log, the slow part seems to be Wait for route53:

2024/05/07 10:06:03 [INFO] Wait for route53 [timeout: 2m0s, interval: 1s]
2024/05/07 10:06:45 [INFO] [a.example.com] acme: Preparing to solve DNS-01
2024/05/07 10:06:46 [INFO] Wait for route53 [timeout: 2m0s, interval: 1s]
2024/05/07 10:07:24 [INFO] [b.example.com] acme: Preparing to solve DNS-01
2024/05/07 10:07:25 [INFO] Wait for route53 [timeout: 2m0s, interval: 1s]
2024/05/07 10:08:05 [INFO] [c.example.com] acme: Preparing to solve DNS-01

2024/05/07 10:27:52 [INFO] [a.example.com] acme: Cleaning DNS-01 challenge
2024/05/07 10:27:53 [INFO] Wait for route53 [timeout: 2m0s, interval: 1s]
2024/05/07 10:28:28 [INFO] [b.example.com] acme: Cleaning DNS-01 challenge
2024/05/07 10:28:29 [INFO] Wait for route53 [timeout: 2m0s, interval: 1s]
2024/05/07 10:29:08 [INFO] [c.example.com] acme: Cleaning DNS-01 challenge
2024/05/07 10:29:09 [INFO] Wait for route53 [timeout: 2m0s, interval: 1s]
2024/05/07 10:30:03 [INFO] [d.example.com] acme: Cleaning DNS-01 challenge

I created a branch with a log at the end of the wait, just be sure. Can you try it?

https://github.com/ldez/lego/tree/wip/route53-debug

linsomniac commented 1 month ago

I sent a link to the full log again on X, it looks like it is failing due to missing region. Let me see if I can fix that and run it again. I set the region to us-east-1 in the AWS_REGION environment variable. It's running now, I'll give you an update in an hour. :-)

linsomniac commented 1 month ago

I sent another link on X to a gist of the log output of running that branch.

ldez commented 1 month ago

Are you sure you're using my branch? Because End of wait for logs are missing.

linsomniac commented 1 month ago

Sent you another one, I think this one is correctly built off that branch. Sorry about that.

ldez commented 1 month ago

The logs confirmed my idea: the slow part is the Wait for route53.

2024/05/08 15:54:22 [INFO] [a.example.com] acme: Preparing to solve DNS-01
2024/05/08 15:54:23 [INFO] Wait for route53 [timeout: 2m0s, interval: 1s]
2024/05/08 15:55:01 [INFO] End of wait for route53 [timeout: 2m0s, interval: 1s]
2024/05/08 15:55:01 [INFO] [c.example.com] acme: Preparing to solve DNS-01
2024/05/08 15:55:02 [INFO] Wait for route53 [timeout: 2m0s, interval: 1s]
2024/05/08 15:55:40 [INFO] End of wait for route53 [timeout: 2m0s, interval: 1s]
2024/05/08 15:55:40 [INFO] [b.example.com] acme: Preparing to solve DNS-01
2024/05/08 15:55:41 [INFO] Wait for route53 [timeout: 2m0s, interval: 1s]
2024/05/08 15:56:19 [INFO] End of wait for route53 [timeout: 2m0s, interval: 1s]

The wait was introduced because of a bug https://github.com/go-acme/lego/issues/94#issuecomment-179504193 inside PR #97.

:thinking: maybe I can add an option to skip this part, but I don't know what will be the side effects.

ldez commented 1 month ago

I updated my branch.

Can you try (with my branch) to set the env var AWS_WAIT_FOR_RECORD_SETS_CHANGED to false?

ldez commented 1 month ago

To be validated the challenge requires having an available TXT record, this means that Let's Encrypt should be able to get this TXT record with a DNS call.

The wait for the changes of the record sets is useful because if the changes are not applied the record is unavailable.

Don't wait can be a problem, especially with domains in the same zone, because the route53 API requires posting all the records every time. If the first change is not applied, the second change will not use the right information.

There 3 wait strategies:

wait a fixed time (it's weak and flaky or slow)
wait by using the route53 API (the current implementation)
wait for the DNS propagation (also available in the current implementation)

The DNS provider implementation inside lego works by domain without knowledge of the other domains, so it's not possible to group domains to call the route53 API.

The option AWS_WAIT_FOR_RECORD_SETS_CHANGED can be used (to disable the wait for the changes), but I'm afraid that will create major side effects.

linsomniac commented 1 month ago

Ran the new version, sent the output, runtime was down under 14 minutes. Thanks for your attention on this, I had thought that it was a simple restructuring of the update/wait/verify logic, but it sounds like the DNS provider implementation doesn't lend to working in that way, which I understand. Thanks for explaining that. I'm going to close this ticket as it seems like there isn't a reliable solution, though there is an unreliable solution for use cases where that's ok.

go-acme / lego