Errors during challenges may leave pending authorizations

Describe the bug When a request with a large number of domains is processed, an early verification error will error_exit and apparently leave pending authorizations. If this is repeated a few times, the rate limit of 300/account may be reached quickly.

Such pending authorizations may count towards the account-based rate even if the staging CA is used - I was working exclusively with the staging CA and hit the limit as soon as I switched to the production CA and couldn't possibly have reached the limit with the first request (it had about 70 domains, though). With a new account, there were no rate problems.

https://letsencrypt.org/docs/rate-limits/ explicitly says to use the staging CA when that limit is hit, but I am 100% sure that I did exactly that - I left the CA setting on its default until just before the limit was reached.

To Reproduce Not sure. I have tried to use a fake request against staging with close to 100 domains and a fresh account, aborting immediately after the challenge tokens have been fetched for each domain. After having put 300+ response statuses into pending that way, the rate limit for pending authorizations did not trigger, for neither CA. Might need more runs, though, can't really say.

Expected behavior I guess what should happen is that the pending domain challenges that are skipped on an error_exit during check_challenge_completion() should be processed in some way and not left pending. Tools like clear_authz invalidate such pending requests based on logs, as far as I understand, to address the pending limit.

Operating system (please complete the following information):

OS: CentOS Linux release 7.3.1611 (Core)
Bash Version GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)

Additional context For my test I modified the code to continue processing domains when certain verification errors based on propagation time were encountered. The goal was to finish the simple cases without having to wait on the problem cases to be resolved. But it basically achieves by accident what I think is needed here.

My changes are rather quick and dirty so I wouldn't dare to put that in a pull request and prefer to summarize the changes - ithey are fairly straightforward:

Add a global array for skipped domains plus additional info (error_detail)
in fulfill_challenges() reset an indicator at the start of the domain loop
in check_challenge_completion() push to the array and set indicator + break instead of error_exit
use the indicator back in fulfill_changes() to decide if the local challenge data should be cleared for a skip
after the loop completed, error_exit if there were skips and print out the array entries for a summary

BTW, I also added an "i of n" at the start of each domain to the output, to better see the progress - not related to the problem, but possibly a useful addition in general?

Large numbers of domains in a single certificate results in large certificates. Although many CAs allow 35 - 250 names/certificate, this isn't good for performance of the service. Let's Encrypt limits you to 100 names/certificate. In any case, large certificates have to be sent over the wire, and at the client, decoded & a linear search performed to match the names. It adds up. Unless all the domains are hosted on a single webserver, you also end up with multiple copies of the private key; if any one is compromised, all the servers are. As LetsEncrypt says: " For performance and reliability reasons, it’s better to use fewer names per certificate whenever you can."

Since you mention propagation times, I assume you're using DNS validation, with a provider that is slow and/or has long TTLs. You might consider using http-01 validation instead. Then the propagation time issue goes away - the tokens are installed with cp/ssh/scp/ftp...

A complete solution has to deal with more than run of getssl - it, and/or the machine it's running on can crash or lose network connectivity. So an array-based solution is inadequate.

The "right" answer is more like having getssl put authorization requests into persistent storage (a file), leave tokens in place on failure, and try triggering a validation for stored requests before creating a new one. This will either succeed or fail -- either way clearing the pending authorization. If there is a pending authorization that succeeds, it also avoids adding and removing the tokens each time - which for DNS isn't cheap and triggers the propagation delay again. It has the advantage that you can run getssl an hour or two later and getssl would pick up where it left off - with the old authorization request(s). That should handle even the slowest DNS updates..

However, this is a significant change to the logic and also requires a (writable) persistent directory (e.g. /var/state/getssl, which would be a new requirement.

Try http-01 validation if you can.

Also, if you use Apache httpd, look at mod_md - in addition to dns-01, it supports tls-alpn-01 (faster, more secure), as well as http-01 without writing the tokens to disk. You may find that it is a better fit.

You make good points about using large SANS and I agree fully, though for now I don't really have a choice here,

But I think I can actually can use http-01 for those particular certs, though it still needs some fiddling - I am working on a loadbalancer appliance here, I have to properly import the challenge token into the config plus coax the correct services into using it to respond to the LE request.

Just happened to test with dns-01 here when I ran into this and it helped fix some propagation issues besides. As dns-01 will be needed for domains that are not externally accessible anyway, it was the primary focus.

BTW, dns-01 (and http-01 as well, I would guess) actually already works by picking off where it left off since domains, once validated, stay validated at LE for about a month, judging from the expires field in "valid" state responses. I was able to get the cert via dns-01 in multiple runs, which was very convenient.

I might take a stab at implementing your "right" answer for my use case and could be back with pull requests at some point in the future - but no guarantees, I invite you to do so in parallel as well ... ;)

Maybe a question about that rate limit: I am not quite sure anymore that the "pending" requests that were skipped due to a verification error are the actual problem, as RFC8555 says in 7.5.1:

The server SHOULD NOT remove challenges with status "invalid"'

On the other HAND, I couldn't have accumulated 300 invalid challenges with my tests either, so it must have been those "pending" open status challenges - as implied by the name of the limit... do you have some insight here?

Also: do you know why working with the staging CA apparently counts toward this limit? this is supposed to be the proper way according to the docs:

You can have a maximum of 300 Pending Authorizations on your account. Hitting this rate limit is rare, and happens most often when developing ACME clients. It usually means that your client is creating authorizations and not fulfilling them. Please utilize our staging environment if you’re developing an ACME client.

But I think I can actually can use http-01 for those particular certs, though it still needs some fiddling - I am working on a loadbalancer appliance here, I have to properly import the challenge token into the config plus coax the correct services into using it to respond to the LE request.

Ordinarily, the members of a load-balanced cluster have (at least internally visible) hostnames and addresses. All you need to do is to push the token to all the members' .well-known/acme-challenge directory(ies). Note that getssl locations (ACL= parameter) can be a list of targets - even mixed methods. e.g.

ACL=('/var/www/${DOMAIN}/web/.well-known/acme-challenge';'ssh:server5:/var/www/${DOMAIN}/web/.well-known/acme-challenge';'ftpes:abuser:secret:server6:/var/www/${DOMAIN}/web/.well-known/acme-challenge';'ssh:server7:/var/www/${DOMAIN}/web/.well-known/acme-challenge')

You can merge all domains into a single target by setting USE_SINGLE_ACL to true', in which case${ACL[0]}' (the first element) will be used for all.

Alternatively, you can keep that directory on a single member (host) - ideally the one running getssl - and tell the load balancer to direct all requests for .well-known/acme-challenge to that member. Or put .well-known/acme-challenge on an NFS (or windows) directory and point the members there. Either way, there's a slight risk of that URI being used for a DOS attack on that member - but since this is normally not accessed frequently, if you're concerned you can put a rate limit on it.

Just happened to test with dns-01 here when I ran into this and it helped fix some propagation issues besides. As dns-01 will be needed for domains that are not externally accessible anyway, it was the primary focus.

That's a viable strategy for internal domains (I use it), but not the only one. If you have internal domains and use split (DNS) views, you can use the external DNS view to alias all the internal hostnames to an externally-accessible host; put the tokens there, and have the webserver return 'Forbidden' to everything except .well-known/acme-challenge. Or use a firewall rule to drop other connections with no response (making it look like a closed port). Note that LE's validation will follow CNAMEs and while the initial request will be http (port 80), LE will also accept redirects to port 443 (TLS).

BTW, dns-01 (and http-01 as well, I would guess) actually already works by picking off where it left off since domains, once validated, stay validated at LE for about a month, judging from the expires field in "valid" state responses. I was able to get the cert via dns-01 in multiple runs, which was very convenient.

That's true IF LE gets the 'go-ahead, OK to validate' response from the client (getssl). The pending authorization queue pileup happens when the client doesn't respond. At that point, the client has said 'give me a token and wait', but hasn't said 'ready for validation'.

I might take a stab at implementing your "right" answer for my use case and could be back with pull requests at some point in the future - but no guarantees, I invite you to do so in parallel as well ... ;)

I'm not up for that - although I've contributed to getssl, I'm not a primary developer. And my quota for time to fix things in it is close to exhaustion. (I recently did non-trivial surgery on the test system - it didn't get scars, but I did...) @timkimber is generally receptive to pull requests - but this one would require a lot of testing (possibly with updates to the tests) to avoid regressions. A lot of people depend on this working, so beyond the technical details, you'd need a test strategy.

Maybe a question about that rate limit: I am not quite sure anymore that the "pending" requests that were skipped due to a verification error are the actual problem, as RFC8555 says in 7.5.1:

The server SHOULD NOT remove challenges with status "invalid"'

See above. The issue is that LE is not removing the pending authorization requests, AND getssl never reuses them - instead (because it forgets them), it issues a new authorization request. LE didn't try to validate the old one, so it accepts the new authorization request - since it can't reply "already validated". This bumps the pending queue. LE eventually times out the authorization request (IIRC, about a week later).

If the validation succeeds, LE holds that result for 30 daze & returns "already validated" to new authorization requests.

On the other HAND, I couldn't have accumulated 300 invalid challenges with my tests either, so it must have been those "pending" open status challenges - as implied by the name of the limit... do you have some insight here?

300 authorization requests / 70 hosts = 4.3 attempts. Did you try at least 5 times?

Also: do you know why working with the staging CA apparently counts toward this limit? this is supposed to be the proper way according to the docs:

Because the pending authorizations limit isn't different for the staging environment. https://letsencrypt.org/docs/staging-environment/

The staging environment uses the same rate limits as described for the production environment with the following exceptions: The Certificates per Registered Domain limit is 30,000 per week. The Duplicate Certificate limit is 30,000 per week. The Failed Validations limit is 60 per hour. The Accounts per IP Address limit is 50 accounts per 3 hour period per IP. For ACME v2, the New Orders limit is 1,500 new orders per 3 hour period per account.

You can have a maximum of 300 Pending Authorizations on your account. Hitting this rate limit is rare, and happens most often when developing ACME clients. It usually means that your client is creating authorizations and not fulfilling them. Please utilize our staging environment if you’re developing an ACME client.

Edit: corrected ACL example.

But I think I can actually can use http-01 for those particular certs, though it still needs some fiddling - I am working on a loadbalancer appliance here, I have to properly import the challenge token into the config plus coax the correct services into using it to respond to the LE request.

Ordinarily, the members of a load-balanced cluster have (at least internally visible) hostnames and addresses. All you need to do is to push the token to all the members' .well-known/acme-challenge directory(ies). Note that getssl locations (ACL= parameter) can be a list of targets - even mixed methods. e.g.

Copying files is not involved in my case, it really is an import into the config subsystem and making the correct virtual http service use this config entry. But I am confident that I will get this to work, so all should be good here.

Just happened to test with dns-01 here when I ran into this and it helped fix some propagation issues besides. As dns-01 will be needed for domains that are not externally accessible anyway, it was the primary focus.

That's a viable strategy for internal domains (I use it), but not the only one. If you have internal domains and use split (DNS) views, you can use the external DNS view to alias all the internal hostnames to an externally-accessible host; put the tokens there, and have the webserver return 'Forbidden' to everything except .well-known/acme-challenge. Or use a firewall rule to drop other connections with no response (making it look like a closed port). Note that LE's validation will follow CNAMEs and while the initial request will be http (port 80), LE will also accept redirects to port 443 (TLS).

Interesting, I did not know about this split view feature - but then, networking is not my area of expertise and I only have very limited influence on the overall networking settings. I don't really expect this to change in the near future without considerable effortand I think the current options will be sufficient for me.

I might take a stab at implementing your "right" answer for my use case and could be back with pull requests at some point in the future - but no guarantees, I invite you to do so in parallel as well ... ;)

I'm not up for that - although I've contributed to getssl, I'm not a primary developer. And my quota for time to fix things in it is close to exhaustion. (I recently did non-trivial surgery on the test system - it didn't get scars, but I did...) @timkimber is generally receptive to pull requests - but this one would require a lot of testing (possibly with updates to the tests) to avoid regressions. A lot of people depend on this working, so beyond the technical details, you'd need a test strategy.

Never mind... it was worth a try, I didn't really expect you to "fall" for that ;)

I don't think the change is that big a deal if this is an experimental feature that has to be consciously turned on... let's see what happens if I even go that way. Reading through the RFC I have come across preauthorization which sounds like a good way to largely decouple the authorizations from the actual cert requests, which I could see as a completely separate option in getssl as well.

Also, I think it might be possible to actually get the pending requests from LE itself, but I haven't actually checked yet.

On the other HAND, I couldn't have accumulated 300 invalid challenges with my tests either, so it must have been those "pending" open status challenges - as implied by the name of the limit... do you have some insight here? 300 authorization requests / 70 hosts = 4.3 attempts. Did you try at least 5 times?

Would fit, because the initial attempts failed early on a domain that could't possibly work because of a DNS misconfiguration, which should hvae left about 60 pending challenges per attempt. At some point I patched to have getssl skip failed domains which should have helped - but only because I wanted to see if there were more problems; I wasn't even aware being in danger of hitting a rate limit (because staging). After throwing out the misconfigured domain out and authorizing the rest in staging, I switched CA and rate limit was hit immediately - the timing might be coincidence.

Also: do you know why working with the staging CA apparently counts toward this limit? this is supposed to be the proper way according to the docs: Because the pending authorizations limit isn't different for the staging environment. https://letsencrypt.org/docs/staging-environment/

Fair point. But then why mention the staging a CA as if it were a solution in the LE doc? It is kind of misleading...

You can have a maximum of 300 Pending Authorizations on your account. Hitting this rate limit is rare, and happens most often when developing ACME clients. It usually means that your client is creating authorizations and not fulfilling them. Please utilize our staging environment if you’re developing an ACME client.

Rereading this, you could somehow argue that the last sentence is just a stand-alone recommendation for how to develop ACME clients. But for some reason is attached to hitting this particular limit as if it had any impact on it?

Don't get too excited about preauth.

https://github.com/letsencrypt/boulder/blob/master/docs/acme-divergences.md

Section 7.4.1 Pre-authorization is an optional feature and we have no plans to implement it. V2 clients should use order based issuance without pre-authorization.

I don't work for LE, so I'm not defending or excusing it. Just pointing out where they document what they do.

They probably want you to use the staging environment because you do get some higher limits if you're well-behaved. And cluttering the production environment is more costly for them (and the production users). 300 pending auths x a developer who uses multiple accounts adds up...

For split views (a.k.a. split horizons), see https://kb.isc.org/docs/aa-00851 and https://bind9.readthedocs.io/en/latest/advanced.html#split-dns

So far as I know, there's no way to get a list of pending authorizations. The client is expected to keep track of them.

Don't get too excited about preauth.

https://github.com/letsencrypt/boulder/blob/master/docs/acme-divergences.md

Okay, I saw it was optional but naively assumed that LE would implement it - just checked via /directory, they in fact do not... well.

So far as I know, there's no way to get a list of pending authorizations. The client is expected to keep track of them.

Checked that as well - I was basing my hope on the orders field that is specified in 7.1.2.1 of the RFC.

But an account object returned by LE via new-acct does not contain that field, so you need to keep track of the orders at least.

But returned order objects contain the authorizations field, so the following approach could work:

during a run, simply store the order url per CN
resume from a stored order, if it is already present - get order, get authz in order and complete them
remove stored order on successful completion.
If you want to give up on the order, intentionally fail all remaining authz in the order, then remove (FORCE_FAIL file?).

This should give sufficient control over the pending auth limit. I initially thought that all individual auth objects would need to be stored, the above approach seems a little less cluttered - I guess I will try...

I don't think it's quite that simple.

Besides finding the authz, you also have to validate that the client can see the token before completing the authz. E.g. if the DNS update or HTTP file placement (or in your case, configuration change) initially failed or disappeared. You may have to retry placing the token in these cases. You may also have to get a new nonce, since considerable time may elapse between runs.

As a general rule, asking a server for something that you can cache is not a good idea. In this case, it's easy to know all the state transitions since the client is involved, so managing a cache is easy.

Remember that "get order" will activate cURL, cause a network connection & an interactionwith the server. Plus, you then have to decode the order object to get the challenges. All this isn't cheap. Keeping one or more local files around is.

One easy (lazy) way to handle this is to create a directory named for the domain, and file for each challenge in that directory. This makes the bookeeping easy - ls -1 $DIR/* gives you the pending challenges; cat <<<"$CHALLENGE" >"$DIR/$DOMAIN/$n" and rm add and remove them. But be careful about wildcards in the domain name... (rm $DIR/*.example.net is probably not what you want). You can use $DIR/domain.1 for regular domains, and $DIR/domain.wild for wildcards. rm -rf $DIR abandons the order - of course after telling the server. $n is a hash of the challenge (at least the URL).

In any case, you need to keep track of the challenge status, URL, type, and token in $CHALLENGE. In files, you can separate them with newlines and use read to extract them. You don't want to "complete" a challenge unless it's pending or processing

Whether you cacne or request a list:

You need to keep in mind that a challenge may (with LE, usually will) be repeated, which is why you need the URL's hash in the file name. E.g. to defeat routing attacks, LE uses multiple servers to ensure that the challenge works from serveral places in the internet .

You need to clean up the pending challenges when an order completes, or the last one transitions to an end state.

It may be necessary to add a switch (e.g. --abandon-request) to provide a mechanism for a fresh start; Not sure. Everything should work, and a switch is likely to be abused. But then there's Murphy's law...

Note that this isn't only a dns-01 issue. http-01 can also end up pending if the file placement fails; e.g. due to a network/server down, or simply a permissions issue (file or account) at the remote.

You will need to develop test cases and add them to the test suite.

It's all doable, but it's not trivial.

srvrco / getssl

Errors during challenges may leave pending authorizations #695