letsencrypt / boulder

An ACME-based certificate authority, written in Go.
Mozilla Public License 2.0
5.19k stars 607 forks source link

inconsistent challenge failures for multi-name certs #1943

Closed seefood closed 7 years ago

seefood commented 8 years ago

I have set up my client (letsencrypt.sh) to cron run once a month, so I don't know when this started, but here are the main points:

http://pastebin.com/jexxFNMe

My guess is something changed with the timeouts of the production server. my server is in Israel and on a 8 year old Xeon machine. latency may be a slight issue, but not one that warrants the server giving up too quickly. (Also, note my second point above, it still DOES work with the staging CA)

jsha commented 8 years ago

I can confirm that Boulder in production has logs indication its request to your host timed out while waiting for headers. I don't think it's because your server is in Israel - for instance I can curl uma.scso.com in 650 milliseconds from my laptop in the US.

That leaves the possibility that your server is overloaded (as you say, it's an old machine). Do you have monitoring of its response times and load? Do you find that the load is particularly high during the failed requests? Or conversely, is it possible the web server is getting swapped out due to low request volume?

The fact that the first validation request times out and subsequent ones succeed suggests there may be some sort of caching or warmup effect on your box. You may be able to work around by submitting a test request for the validation URL before requesting that Boulder validate it.

seefood commented 8 years ago

I don't think it's my server, since load average: 0.35, 0.45, 0.46, 8 cores and 10G RAM, swap practically unused, and the nginx is very responsive. most importantly, it works smoothly with the staging URL. also, I tried at different times of the day, the result is the same.

jsha commented 8 years ago

Are you able to consistently reproduce the connection failure right now, under those low-load conditions?

jsha commented 8 years ago

Also, does your client set up its own web server or provide files for a running web server?

seefood commented 8 years ago

Yes, it's very consistent, and no, it plants the challenge responses in a directory served by nginx at the expected URL. both names are aliases of the same vhost, so the webserver config for both is definitely the same.

Here is a test for another cert, where I switched around the two names' again the first one passes, the alias fails:

Processing mail.site.co.il with alternative names: site.co.il
 + Signing domains...
 + Generating private key...
 + Generating signing request...
 + Requesting challenge for mail.site.co.il...
 + Requesting challenge for site.co.il...
 + Responding to challenge for mail.site.co.il...
 + Challenge is valid!
 + Responding to challenge for site.co.il...
ERROR: Challenge is invalid! (returned: invalid) (result: {
  "type": "http-01",
  "status": "invalid",
  "error": {
    "type": "urn:acme:error:connection",
    "detail": "Could not connect to http://site.co.il/.well-known/acme-challenge/md58luZ1023oaL-lACkV6kbxsWiM25Xj5FnLh6ziZ9Y",
    "status": 400
[...]

Both pointing of course at the same directory. To test timing responses, you can use these: http://site.co.il/.well-known/acme-challenge/321.txt http://mail.site.co.il/.well-known/acme-challenge/321.txt

testing instead with CA="https://acme-staging.api.letsencrypt.org/directory":

Processing mail.site.co.il with alternative names: site.co.il
 + Signing domains...
 + Generating private key...
 + Generating signing request...
 + Requesting challenge for mail.site.co.il...
 + Requesting challenge for site.co.il...
 + Responding to challenge for mail.site.co.il...
 + Challenge is valid!
 + Responding to challenge for site.co.il...
 + Challenge is valid!
 + Requesting certificate...
 + Checking certificate...
 + Done!
 + Creating fullchain.pem...
[...]

Just as expected, but not with the production CA...

seefood commented 8 years ago

Switched back to the production CA, and got a 500 error. Does this help?

 + Signing domains...
 + Generating private key...
 + Generating signing request...
 + Requesting challenge for mail.site.co.il...
 + Requesting challenge for site.co.il...
  + ERROR: An error occurred while sending post-request to https://acme-v01.api.letsencrypt.org/acme/new-authz (Status 500)

Details:
<HTML><HEAD><TITLE>Error</TITLE></HEAD><BODY>
An error occurred while processing your request.<p>
Reference&#32;&#35;179&#46;2400a81f&#46;1466573718&#46;202142c7
</BODY></HTML>

Immediately ran again, and again produced the usual - challenge passes the first name, fails on the second.

seefood commented 8 years ago

Made another test, this time with automatic http->https redirection turned off, and annoyingly now it worked. So here's a new important piece of information for you :-)

[... other error example deleted, was a this was an error of some other sort, DNS related ...]

Turning on the redirection after I got new valid certs, so the configuration is back at the baseline. Still pretty sure it's a timeout issue on the production CA, or it's overloaded.

cpu commented 8 years ago

Made another test, this time with automatic http->https redirection turned off, and annoyingly now it worked. So here's a new important piece of information for you :-)

Can you share some details of how this redirect is implemented in your web server configuration?

seefood commented 8 years ago

sure, I just add this in the host definition (nginx)

if ($ssl_protocol = "") {
  rewrite ^ https://$host$request_uri permanent;
}

here, permanent means a 301 Moved Permanently reply to the client that requests on port 80. the validtor would have to make a second http request, of course, but the documentation says it is perfectly acceptable.

pastukhov commented 8 years ago

Same for me today

  + ERROR: An error occurred while sending post-request to https://acme-staging.api.letsencrypt.org/acme/challenge/NJw6oNi2tQYUza8ZIdUVKLgxPi7SKXJ0Mi_GOIYQeB8/14214224 (Status 500)

Details:
<HTML><HEAD><TITLE>Error</TITLE></HEAD><BODY>
An error occurred while processing your request.<p>
Reference&#32;&#35;179&#46;609b7b5c&#46;1473943695&#46;11cf3b5c
</BODY></HTML>
pastukhov commented 8 years ago

And same for production

 + ERROR: An error occurred while sending post-request to https://acme-v01.api.letsencrypt.org/acme/challenge/GCoS0y4v4B3Ng9n7O4wgzlyMOgh57Fwwx6chjkDwngY/265138764 (Status 500)

Details:
<HTML><HEAD><TITLE>Error</TITLE></HEAD><BODY>
An error occurred while processing your request.<p>
Reference&#32;&#35;179&#46;609b7b5c&#46;1473944118&#46;11e1ca6b
</BODY></HTML>
jsha commented 8 years ago

@pastukhov are you also using letsencrypt.sh?

pastukhov commented 8 years ago

@jsha letsencrypt.sh and acme-tiny The second is returning httplib.BadStatusLine

jsha commented 8 years ago

I assume this reproduces reliably? If so, can you tweak one of those tools to output the full headers from the response, as well as the body that was POSTed, and provide those logs? Thanks!

pastukhov commented 8 years ago

No, this bug is floating. I will run letsencrypt.sh with bash -x next few times and send it here if i catch it again.

jsha commented 8 years ago

Note that bash -x is insufficient to get full logs. You'll need to add -vv to curl.

pastukhov commented 8 years ago

Ok

pastukhov commented 8 years ago

Doesn't reproducing today.

jsha commented 7 years ago

Conclusion: This error occurs intermittently when Let's Encrypt is having an outage. If you encounter this bug in the future, please check https://letsencrypt.status.io/ to see if there is currently an outage, and try again later.

fragpit commented 7 years ago

facing this issue today, despite that letsencrypt status is ok. It happens on about 124 requests.

+ ERROR: An error occurred while sending post-request to https://acme-staging.api.letsencrypt.org/acme/new-authz (Status 500)

Details:
<HTML><HEAD><TITLE>Error</TITLE></HEAD><BODY>
An error occurred while processing your request.<p>
Reference&#32;&#35;179&#46;72fcd4d9&#46;1481875287&#46;9e6f771
</BODY></HTML>

 + CloudFlare hook executing: clean_challenge
 + http_request() error in letsencrypt.sh?
cpu commented 7 years ago

@fragpit What domains are you trying to issue for? What is your registration ID in staging? What ACME client are you using, and do you have logs you can share beyond the above? Do you believe it's related to the original issue in this thread? It seems to me to be a generic 500 result.

fragpit commented 7 years ago

I use dehydrated client. Also I found that if I add sleep 5 to the for loop, which make DNS challenge, everything goes well.

cpu commented 7 years ago

@fragpit What domains are you trying to issue for? What is your registration ID in staging?

cpu commented 7 years ago

@fragpit This isn't realted to the issue from this thread, so I opened a new one for us to discuss your problem: https://github.com/letsencrypt/boulder/issues/2436 Please comment on there with the information requested.