inconsistent challenge failures for multi-name certs

seefood commented 8 years ago

I have set up my client (letsencrypt.sh) to cron run once a month, so I don't know when this started, but here are the main points:

http://pastebin.com/jexxFNMe

3 months ago this same configuration works smoothly.
it still works smoothly with the staging CA URL, but not with production.
Other than the occasional error 500 on request, the failures usually happen when validating the first or second alias of a cert. if I change the order, the alias (now main name) passes and the main name (now alis) fails, which says, basically, it is not an issue on my end.
Even more evidence, I keep getting "error": { "type": "urn:acme:error:connection", and indeed my server doesn't even see the request.

My guess is something changed with the timeouts of the production server. my server is in Israel and on a 8 year old Xeon machine. latency may be a slight issue, but not one that warrants the server giving up too quickly. (Also, note my second point above, it still DOES work with the staging CA)

jsha commented 8 years ago

I can confirm that Boulder in production has logs indication its request to your host timed out while waiting for headers. I don't think it's because your server is in Israel - for instance I can curl uma.scso.com in 650 milliseconds from my laptop in the US.

That leaves the possibility that your server is overloaded (as you say, it's an old machine). Do you have monitoring of its response times and load? Do you find that the load is particularly high during the failed requests? Or conversely, is it possible the web server is getting swapped out due to low request volume?

The fact that the first validation request times out and subsequent ones succeed suggests there may be some sort of caching or warmup effect on your box. You may be able to work around by submitting a test request for the validation URL before requesting that Boulder validate it.

seefood commented 8 years ago

I don't think it's my server, since load average: 0.35, 0.45, 0.46, 8 cores and 10G RAM, swap practically unused, and the nginx is very responsive. most importantly, it works smoothly with the staging URL. also, I tried at different times of the day, the result is the same.

jsha commented 8 years ago

Are you able to consistently reproduce the connection failure right now, under those low-load conditions?

jsha commented 8 years ago

Also, does your client set up its own web server or provide files for a running web server?

seefood commented 8 years ago

Yes, it's very consistent, and no, it plants the challenge responses in a directory served by nginx at the expected URL. both names are aliases of the same vhost, so the webserver config for both is definitely the same.

Here is a test for another cert, where I switched around the two names' again the first one passes, the alias fails:

Processing mail.site.co.il with alternative names: site.co.il
 + Signing domains...
 + Generating private key...
 + Generating signing request...
 + Requesting challenge for mail.site.co.il...
 + Requesting challenge for site.co.il...
 + Responding to challenge for mail.site.co.il...
 + Challenge is valid!
 + Responding to challenge for site.co.il...
ERROR: Challenge is invalid! (returned: invalid) (result: {
  "type": "http-01",
  "status": "invalid",
  "error": {
    "type": "urn:acme:error:connection",
    "detail": "Could not connect to http://site.co.il/.well-known/acme-challenge/md58luZ1023oaL-lACkV6kbxsWiM25Xj5FnLh6ziZ9Y",
    "status": 400
[...]

Both pointing of course at the same directory. To test timing responses, you can use these: http://site.co.il/.well-known/acme-challenge/321.txt http://mail.site.co.il/.well-known/acme-challenge/321.txt

testing instead with CA="https://acme-staging.api.letsencrypt.org/directory":

Processing mail.site.co.il with alternative names: site.co.il
 + Signing domains...
 + Generating private key...
 + Generating signing request...
 + Requesting challenge for mail.site.co.il...
 + Requesting challenge for site.co.il...
 + Responding to challenge for mail.site.co.il...
 + Challenge is valid!
 + Responding to challenge for site.co.il...
 + Challenge is valid!
 + Requesting certificate...
 + Checking certificate...
 + Done!
 + Creating fullchain.pem...
[...]

Just as expected, but not with the production CA...

seefood commented 8 years ago

Switched back to the production CA, and got a 500 error. Does this help?

 + Signing domains...
 + Generating private key...
 + Generating signing request...
 + Requesting challenge for mail.site.co.il...
 + Requesting challenge for site.co.il...
  + ERROR: An error occurred while sending post-request to https://acme-v01.api.letsencrypt.org/acme/new-authz (Status 500)

Details:
<HTML><HEAD><TITLE>Error</TITLE></HEAD><BODY>
An error occurred while processing your request.<p>
Reference&#32;&#35;179&#46;2400a81f&#46;1466573718&#46;202142c7
</BODY></HTML>

Immediately ran again, and again produced the usual - challenge passes the first name, fails on the second.

seefood commented 8 years ago

Made another test, this time with automatic http->https redirection turned off, and annoyingly now it worked. So here's a new important piece of information for you :-)

[... other error example deleted, was a this was an error of some other sort, DNS related ...]

Turning on the redirection after I got new valid certs, so the configuration is back at the baseline. Still pretty sure it's a timeout issue on the production CA, or it's overloaded.

cpu commented 8 years ago

Made another test, this time with automatic http->https redirection turned off, and annoyingly now it worked. So here's a new important piece of information for you :-)

Can you share some details of how this redirect is implemented in your web server configuration?

seefood commented 8 years ago

sure, I just add this in the host definition (nginx)

if ($ssl_protocol = "") {
  rewrite ^ https://$host$request_uri permanent;
}

here, permanent means a 301 Moved Permanently reply to the client that requests on port 80. the validtor would have to make a second http request, of course, but the documentation says it is perfectly acceptable.