Closed seefood closed 7 years ago
I can confirm that Boulder in production has logs indication its request to your host timed out while waiting for headers. I don't think it's because your server is in Israel - for instance I can curl uma.scso.com in 650 milliseconds from my laptop in the US.
That leaves the possibility that your server is overloaded (as you say, it's an old machine). Do you have monitoring of its response times and load? Do you find that the load is particularly high during the failed requests? Or conversely, is it possible the web server is getting swapped out due to low request volume?
The fact that the first validation request times out and subsequent ones succeed suggests there may be some sort of caching or warmup effect on your box. You may be able to work around by submitting a test request for the validation URL before requesting that Boulder validate it.
I don't think it's my server, since load average: 0.35, 0.45, 0.46
, 8 cores and 10G RAM, swap practically unused, and the nginx is very responsive. most importantly, it works smoothly with the staging URL. also, I tried at different times of the day, the result is the same.
Are you able to consistently reproduce the connection failure right now, under those low-load conditions?
Also, does your client set up its own web server or provide files for a running web server?
Yes, it's very consistent, and no, it plants the challenge responses in a directory served by nginx at the expected URL. both names are aliases of the same vhost, so the webserver config for both is definitely the same.
Here is a test for another cert, where I switched around the two names' again the first one passes, the alias fails:
Processing mail.site.co.il with alternative names: site.co.il
+ Signing domains...
+ Generating private key...
+ Generating signing request...
+ Requesting challenge for mail.site.co.il...
+ Requesting challenge for site.co.il...
+ Responding to challenge for mail.site.co.il...
+ Challenge is valid!
+ Responding to challenge for site.co.il...
ERROR: Challenge is invalid! (returned: invalid) (result: {
"type": "http-01",
"status": "invalid",
"error": {
"type": "urn:acme:error:connection",
"detail": "Could not connect to http://site.co.il/.well-known/acme-challenge/md58luZ1023oaL-lACkV6kbxsWiM25Xj5FnLh6ziZ9Y",
"status": 400
[...]
Both pointing of course at the same directory. To test timing responses, you can use these: http://site.co.il/.well-known/acme-challenge/321.txt http://mail.site.co.il/.well-known/acme-challenge/321.txt
testing instead with CA="https://acme-staging.api.letsencrypt.org/directory"
:
Processing mail.site.co.il with alternative names: site.co.il
+ Signing domains...
+ Generating private key...
+ Generating signing request...
+ Requesting challenge for mail.site.co.il...
+ Requesting challenge for site.co.il...
+ Responding to challenge for mail.site.co.il...
+ Challenge is valid!
+ Responding to challenge for site.co.il...
+ Challenge is valid!
+ Requesting certificate...
+ Checking certificate...
+ Done!
+ Creating fullchain.pem...
[...]
Just as expected, but not with the production CA...
Switched back to the production CA, and got a 500 error. Does this help?
+ Signing domains...
+ Generating private key...
+ Generating signing request...
+ Requesting challenge for mail.site.co.il...
+ Requesting challenge for site.co.il...
+ ERROR: An error occurred while sending post-request to https://acme-v01.api.letsencrypt.org/acme/new-authz (Status 500)
Details:
<HTML><HEAD><TITLE>Error</TITLE></HEAD><BODY>
An error occurred while processing your request.<p>
Reference #179.2400a81f.1466573718.202142c7
</BODY></HTML>
Immediately ran again, and again produced the usual - challenge passes the first name, fails on the second.
Made another test, this time with automatic http->https redirection turned off, and annoyingly now it worked. So here's a new important piece of information for you :-)
[... other error example deleted, was a this was an error of some other sort, DNS related ...]
Turning on the redirection after I got new valid certs, so the configuration is back at the baseline. Still pretty sure it's a timeout issue on the production CA, or it's overloaded.
Made another test, this time with automatic http->https redirection turned off, and annoyingly now it worked. So here's a new important piece of information for you :-)
Can you share some details of how this redirect is implemented in your web server configuration?
sure, I just add this in the host definition (nginx)
if ($ssl_protocol = "") {
rewrite ^ https://$host$request_uri permanent;
}
here, permanent
means a 301 Moved Permanently
reply to the client that requests on port 80. the validtor would have to make a second http request, of course, but the documentation says it is perfectly acceptable.
Same for me today
+ ERROR: An error occurred while sending post-request to https://acme-staging.api.letsencrypt.org/acme/challenge/NJw6oNi2tQYUza8ZIdUVKLgxPi7SKXJ0Mi_GOIYQeB8/14214224 (Status 500)
Details:
<HTML><HEAD><TITLE>Error</TITLE></HEAD><BODY>
An error occurred while processing your request.<p>
Reference #179.609b7b5c.1473943695.11cf3b5c
</BODY></HTML>
And same for production
+ ERROR: An error occurred while sending post-request to https://acme-v01.api.letsencrypt.org/acme/challenge/GCoS0y4v4B3Ng9n7O4wgzlyMOgh57Fwwx6chjkDwngY/265138764 (Status 500)
Details:
<HTML><HEAD><TITLE>Error</TITLE></HEAD><BODY>
An error occurred while processing your request.<p>
Reference #179.609b7b5c.1473944118.11e1ca6b
</BODY></HTML>
@pastukhov are you also using letsencrypt.sh?
@jsha
letsencrypt.sh and acme-tiny
The second is returning httplib.BadStatusLine
I assume this reproduces reliably? If so, can you tweak one of those tools to output the full headers from the response, as well as the body that was POSTed, and provide those logs? Thanks!
No, this bug is floating. I will run letsencrypt.sh with bash -x next few times and send it here if i catch it again.
Note that bash -x
is insufficient to get full logs. You'll need to add -vv
to curl.
Ok
Doesn't reproducing today.
Conclusion: This error occurs intermittently when Let's Encrypt is having an outage. If you encounter this bug in the future, please check https://letsencrypt.status.io/ to see if there is currently an outage, and try again later.
facing this issue today, despite that letsencrypt status is ok. It happens on about 124 requests.
+ ERROR: An error occurred while sending post-request to https://acme-staging.api.letsencrypt.org/acme/new-authz (Status 500)
Details:
<HTML><HEAD><TITLE>Error</TITLE></HEAD><BODY>
An error occurred while processing your request.<p>
Reference #179.72fcd4d9.1481875287.9e6f771
</BODY></HTML>
+ CloudFlare hook executing: clean_challenge
+ http_request() error in letsencrypt.sh?
@fragpit What domains are you trying to issue for? What is your registration ID in staging? What ACME client are you using, and do you have logs you can share beyond the above? Do you believe it's related to the original issue in this thread? It seems to me to be a generic 500 result.
I use dehydrated client. Also I found that if I add sleep 5
to the for
loop, which make DNS challenge, everything goes well.
@fragpit What domains are you trying to issue for? What is your registration ID in staging?
@fragpit This isn't realted to the issue from this thread, so I opened a new one for us to discuss your problem: https://github.com/letsencrypt/boulder/issues/2436 Please comment on there with the information requested.
I have set up my client (letsencrypt.sh) to cron run once a month, so I don't know when this started, but here are the main points:
http://pastebin.com/jexxFNMe
"error": { "type": "urn:acme:error:connection"
, and indeed my server doesn't even see the request.My guess is something changed with the timeouts of the production server. my server is in Israel and on a 8 year old Xeon machine. latency may be a slight issue, but not one that warrants the server giving up too quickly. (Also, note my second point above, it still DOES work with the staging CA)