Open sasa1977 opened 3 years ago
Did you save a copy of the chain that was returned by the server during renewal? I'd love to have a look
What kind of clients? I wonder if it had anything to do with the letsencrypt root ca changes that affected some HTTP clients on recent erlang versions depending on their SSL config.
Best explanation I've heard was on Thinking Elixir a few episodes back. It was a whole thing. End of September was the breakpoint.
We had the same on a current Safari on MacOS and a service we are building, webserver is cowboy. We played around and changing to full_chain fixed it. Didn’t read Brams two blog posts
This was on a server not a client BTW
I've been looking through the code and I can't pinpoint any particular part where things might have broken, but I do suspect the issue wasn't Let's Encrypt returning an expired intermediate, but rather OTP selecting a remnant of a previous chain.
Keep in mind that Erlang/OTP until recently did not give you full control over the chain it would send to clients. Instead, you would give it the server certificate (cert
) and a bunch of other certificates it might find interesting (cacerts
) and it would build a chain of certificates itself (matching Issuer and Subject names, and recursing until the issuer cannot be found or appears to be a self-signed root CA). Only in recent OTP versions can cert
be set to a list containing the server certificate and everything else you want to send, which it will send it as-is.
With Let's Encrypt's great switch between May 4 and Sep 30 they swapped out some intermediate certificates for new ones, with the same Subject, but different Issuer and Expiry. If the earlier version of the R3 intermediate, signed by the DST root CA and expiring on Sep 30, was still reachable by OTP when it tried to construct the chain of certificates to send to clients, it might have selected it over the newer one returned in the chain of the latest certificate renewal.
Again, the code all looks ok, and the PEM cache does get purged during renewal. But this theory seems to me more plausible than Let's Encrypt sending an expired intermediate: others would have noticed that and reported it.
BTW, you said only some clients were reporting issues: a potential explanation for that would be the AIA extension. Many modern clients, in particular web browsers, will fetch intermediate CAs from the URLs specified in that extension, when the server is not sending them or (in this case) is sending outdated copies
I've been looking through the code and I can't pinpoint any particular part where things might have broken, but I do suspect the issue wasn't Let's Encrypt returning an expired intermediate, but rather OTP selecting a remnant of a previous chain.
Yeah, I'm starting to suspect the same thing. I have the previous cert + chain, and it all looks good there. The funny thing is that after renewal the new cert has been used, but it seemed there was something wrong in the chain. So it looks like perhaps :ssl.clear_pem_cache
doesn't clear intermediates?
I'm sorry I didn't try a plain restart before going for the full account recreate. As I said, I have the previous cert stored, so I could try restarting the system with that. If everything works fine it should prove that the problem was not in cert/chain provided by LetsEncrypt, but rather something that was cached in the BEAM instance. I don't have the time to do this immediately (probably won't make it today), but I'll give it a try and report back.
BTW, you said only some clients were reporting issues: a potential explanation for that would be the AIA extension. Many modern clients, in particular web browsers, will fetch intermediate CAs from the URLs specified in that extension, when the server is not sending them or (in this case) is sending outdated copies
Interesting observation! Here's the original issue, though I can't tell which browser has been used. On my machine everything worked fine with Firefox and Chromium. I was able to reproduce the error with curl, and on Android with the Dolphin browser.
The after renewal the new cert has been used
To clarify, I mean here that I did a force cert renew without restart, and the new cert has been immediately used, but some clients still reported R3 expiry.
So it looks like perhaps :ssl.clear_pem_cache doesn't clear intermediates?
It clears the cache that's used to avoid reading and parsing PEM files from disk every time they are referenced, but yes I suspect there is another layer of caching for the server cert's chain. Which makes sense: you wouldn't want to have to to reconstruct the chain again and again, even if the PEM files cache saves you the round-trip to disk and the DER parsing.
I think restarting the Phoenix Endpoint would have resolved the problem.
When I have some more time I will try to reproduce the problem with a small minimal setup, without all the cert renewal business, and then I'll see if I can trace it through :ssl_certificate
and :ssl_pkix_db
.
On my machine everything worked fine with Firefox and Chromium. I was able to reproduce the error with curl, and on Android with the Dolphin browser.
That makes sense: curl wouldn't use AIA, and Chromium definitely does. Not sure about FF, but I guess it does too.
When I have some more time I will try to reproduce the problem with a small minimal setup, without all the cert renewal business, and then I'll see if I can trace it through
:ssl_certificate
and:ssl_pkix_db
.
There is already a test which verifies cert renewal. However it currently doesn't compare the intermediate cert. I think that this test could be adapted to reproduce the problem. If I find the time I'll try to do it myself.
I think restarting the Phoenix Endpoint would have resolved the problem.
Yeah, that's what site_encrypt did initially, but I moved away from it because I wanted to avoid disruption on cert renewal. I guess I could return to restart, although I'm not completely happy. One possible mitigation would be to restart the endpoint only if the chain has been changed since the last time. WDYT?
There is already a test which verifies cert renewal. However it currently doesn't compare the intermediate cert. I think that this test could be adapted to reproduce the problem. If I find the time I'll try to do it myself.
We have until September 2025 to prepare for the next intermediate certificate change :)
Yeah, that's what site_encrypt did initially, but I moved away from it because I wanted to avoid disruption on cert renewal. I guess I could return to restart, although I'm not completely happy. One possible mitigation would be to restart the endpoint only if the chain has been changed since the last time. WDYT?
That's one option. Other options include:
plug_cowboy
, and it may require making assumptions about Plug/Cowboy internalscert
option; only works on OTP 24 or laterI want to try and figure out exactly how the chain caching works, maybe there is an easier way to force ssl
to rebuild the chain...
Recently I got a report that when visiting theerlangelist.com some clients emit an error that the cert has expired. I force renewed the certificate but that didn't help. Apparently, even after renewal, the certificate had two chains, one of which contained the recently expired intermediate certificate.
I was able to solve this by creating the new ACME account from scratch, using the following procedure:
db_folder
property in thecertification
endpoint callback)It feels quite disruptive, and I can't say I understand why I had to recreate the account. Does anyone have a better idea (or can at least explain why was this needed)? Is site_encrypt missing something that would automatically handle this situation?