sasa1977 / site_encrypt

Integrated certification via Let's encrypt for Elixir-powered sites
MIT License
462 stars 33 forks source link

cert has expired #41

Open sasa1977 opened 2 years ago

sasa1977 commented 2 years ago

Recently I got a report that when visiting theerlangelist.com some clients emit an error that the cert has expired. I force renewed the certificate but that didn't help. Apparently, even after renewal, the certificate had two chains, one of which contained the recently expired intermediate certificate.

I was able to solve this by creating the new ACME account from scratch, using the following procedure:

  1. Stop the site
  2. Backup the site_encrypt database (as configured via the db_folder property in the certification endpoint callback)
  3. Remove the site_encrypt database
  4. Start the site

It feels quite disruptive, and I can't say I understand why I had to recreate the account. Does anyone have a better idea (or can at least explain why was this needed)? Is site_encrypt missing something that would automatically handle this situation?

voltone commented 2 years ago

Did you save a copy of the chain that was returned by the server during renewal? I'd love to have a look

lawik commented 2 years ago

What kind of clients? I wonder if it had anything to do with the letsencrypt root ca changes that affected some HTTP clients on recent erlang versions depending on their SSL config.

Best explanation I've heard was on Thinking Elixir a few episodes back. It was a whole thing. End of September was the breakpoint.

peerst commented 2 years ago

We had the same on a current Safari on MacOS and a service we are building, webserver is cowboy. We played around and changing to full_chain fixed it. Didn’t read Brams two blog posts

peerst commented 2 years ago

This was on a server not a client BTW

voltone commented 2 years ago

I've been looking through the code and I can't pinpoint any particular part where things might have broken, but I do suspect the issue wasn't Let's Encrypt returning an expired intermediate, but rather OTP selecting a remnant of a previous chain.

Keep in mind that Erlang/OTP until recently did not give you full control over the chain it would send to clients. Instead, you would give it the server certificate (cert) and a bunch of other certificates it might find interesting (cacerts) and it would build a chain of certificates itself (matching Issuer and Subject names, and recursing until the issuer cannot be found or appears to be a self-signed root CA). Only in recent OTP versions can cert be set to a list containing the server certificate and everything else you want to send, which it will send it as-is.

With Let's Encrypt's great switch between May 4 and Sep 30 they swapped out some intermediate certificates for new ones, with the same Subject, but different Issuer and Expiry. If the earlier version of the R3 intermediate, signed by the DST root CA and expiring on Sep 30, was still reachable by OTP when it tried to construct the chain of certificates to send to clients, it might have selected it over the newer one returned in the chain of the latest certificate renewal.

Again, the code all looks ok, and the PEM cache does get purged during renewal. But this theory seems to me more plausible than Let's Encrypt sending an expired intermediate: others would have noticed that and reported it.

voltone commented 2 years ago

BTW, you said only some clients were reporting issues: a potential explanation for that would be the AIA extension. Many modern clients, in particular web browsers, will fetch intermediate CAs from the URLs specified in that extension, when the server is not sending them or (in this case) is sending outdated copies

sasa1977 commented 2 years ago

I've been looking through the code and I can't pinpoint any particular part where things might have broken, but I do suspect the issue wasn't Let's Encrypt returning an expired intermediate, but rather OTP selecting a remnant of a previous chain.

Yeah, I'm starting to suspect the same thing. I have the previous cert + chain, and it all looks good there. The funny thing is that after renewal the new cert has been used, but it seemed there was something wrong in the chain. So it looks like perhaps :ssl.clear_pem_cache doesn't clear intermediates?

I'm sorry I didn't try a plain restart before going for the full account recreate. As I said, I have the previous cert stored, so I could try restarting the system with that. If everything works fine it should prove that the problem was not in cert/chain provided by LetsEncrypt, but rather something that was cached in the BEAM instance. I don't have the time to do this immediately (probably won't make it today), but I'll give it a try and report back.

BTW, you said only some clients were reporting issues: a potential explanation for that would be the AIA extension. Many modern clients, in particular web browsers, will fetch intermediate CAs from the URLs specified in that extension, when the server is not sending them or (in this case) is sending outdated copies

Interesting observation! Here's the original issue, though I can't tell which browser has been used. On my machine everything worked fine with Firefox and Chromium. I was able to reproduce the error with curl, and on Android with the Dolphin browser.

sasa1977 commented 2 years ago

The after renewal the new cert has been used

To clarify, I mean here that I did a force cert renew without restart, and the new cert has been immediately used, but some clients still reported R3 expiry.

voltone commented 2 years ago

So it looks like perhaps :ssl.clear_pem_cache doesn't clear intermediates?

It clears the cache that's used to avoid reading and parsing PEM files from disk every time they are referenced, but yes I suspect there is another layer of caching for the server cert's chain. Which makes sense: you wouldn't want to have to to reconstruct the chain again and again, even if the PEM files cache saves you the round-trip to disk and the DER parsing.

I think restarting the Phoenix Endpoint would have resolved the problem.

When I have some more time I will try to reproduce the problem with a small minimal setup, without all the cert renewal business, and then I'll see if I can trace it through :ssl_certificate and :ssl_pkix_db.

On my machine everything worked fine with Firefox and Chromium. I was able to reproduce the error with curl, and on Android with the Dolphin browser.

That makes sense: curl wouldn't use AIA, and Chromium definitely does. Not sure about FF, but I guess it does too.

sasa1977 commented 2 years ago

When I have some more time I will try to reproduce the problem with a small minimal setup, without all the cert renewal business, and then I'll see if I can trace it through :ssl_certificate and :ssl_pkix_db.

There is already a test which verifies cert renewal. However it currently doesn't compare the intermediate cert. I think that this test could be adapted to reproduce the problem. If I find the time I'll try to do it myself.

I think restarting the Phoenix Endpoint would have resolved the problem.

Yeah, that's what site_encrypt did initially, but I moved away from it because I wanted to avoid disruption on cert renewal. I guess I could return to restart, although I'm not completely happy. One possible mitigation would be to restart the endpoint only if the chain has been changed since the last time. WDYT?

voltone commented 2 years ago

There is already a test which verifies cert renewal. However it currently doesn't compare the intermediate cert. I think that this test could be adapted to reproduce the problem. If I find the time I'll try to do it myself.

We have until September 2025 to prepare for the next intermediate certificate change :)

Yeah, that's what site_encrypt did initially, but I moved away from it because I wanted to avoid disruption on cert renewal. I guess I could return to restart, although I'm not completely happy. One possible mitigation would be to restart the endpoint only if the chain has been changed since the last time. WDYT?

That's one option. Other options include:

I want to try and figure out exactly how the chain caching works, maybe there is an easier way to force ssl to rebuild the chain...