Closed joshuacwnewton closed 10 months ago
From what I can tell from the documentation, Discourse uses acme.sh
for cert renewal (as opposed to, say, certbot
). Plus, setting up Discourse should automatically enable a cron job that will run acme.sh
:
At the same time, it adds a cron job that runs a daily cert renewal check. This will automatically renew your cert. Nothing happens if cert has not expired. If the certificate does expire, you’ll get an email about it from Let’s Encrypt at the email address you provided during setup.
I checked our app.yml
config file, and everything appears to be setup correctly for LetsEncrypt/SSL:
## Uncomment these two lines if you wish to add Lets Encrypt (https)
- "templates/web.ssl.template.yml"
- "templates/web.letsencrypt.ssl.template.yml"
## which TCP/IP ports should this container expose?
## If you want Discourse to share a port with another webserver like Apache or nginx,
## see https://meta.discourse.org/t/17247 for details
expose:
- "80:80" # http
- "443:443" # https
[...]
## If you added the Lets Encrypt template, uncomment below to get a free SSL certificate
LETSENCRYPT_ACCOUNT_EMAIL: neuropoly-admin@liste.polymtl.ca
Yet, we've never received an email at "neuropoly-admin@liste.polymtl.ca" about a LetsEncrypt expiry. Very curious!
Digging in deeper into the acme.sh
logs:
root@forum:~# cd /var/discourse
root@forum:/var/discourse# ./launcher enter app
root@forum-app:/var/www/discourse# cd /shared/letsencrypt
root@forum-app:/shared/letsencrypt# cat acme.sh.log
[...]
[Wed Sep 6 00:20:12 UTC 2023] Skip, Next renewal time is: 2023-09-07T00:03:16Z
[Wed Sep 6 00:20:12 UTC 2023] Add '--force' to force to renew.
[Wed Sep 6 00:20:12 UTC 2023] Return code: 2
[Wed Sep 6 00:20:12 UTC 2023] Skipped forum.spinalcordmri.org
So, the cron job is running, and the renewal check is being skipped (as it should)...
If I grep for renew
, then scroll back far enough, I see the following:
root@forum-app:/shared/letsencrypt# cat acme.sh.log | grep renew
[Sun Jul 9 00:03:07 UTC 2023] Skip, Next renewal time is: 2023-07-09T18:30:32Z
[Sun Jul 9 00:03:07 UTC 2023] Add '--force' to force to renew.
[Sun Jul 9 00:03:07 UTC 2023] _renewServer
[Sun Jul 9 00:03:07 UTC 2023] Skip, Next renewal time is: 2023-07-09T18:30:36Z
[Sun Jul 9 00:03:07 UTC 2023] Add '--force' to force to renew.
[Mon Jul 10 00:03:01 UTC 2023] _renewServer
[Mon Jul 10 00:03:06 UTC 2023] Error renew devforum.spinalcordmri.org.
[Mon Jul 10 00:03:06 UTC 2023] _renewServer
[Mon Jul 10 00:03:10 UTC 2023] Error renew devforum.spinalcordmri.org_ecc.
[Mon Jul 10 00:03:10 UTC 2023] _renewServer
[Mon Jul 10 00:03:16 UTC 2023] _renewServer
[Tue Jul 11 00:03:02 UTC 2023] _renewServer
[Tue Jul 11 00:03:06 UTC 2023] Error renew devforum.spinalcordmri.org.
[Tue Jul 11 00:03:06 UTC 2023] _renewServer
[Tue Jul 11 00:03:11 UTC 2023] Error renew devforum.spinalcordmri.org_ecc.
[Tue Jul 11 00:03:11 UTC 2023] _renewServer
[Tue Jul 11 00:03:11 UTC 2023] Skip, Next renewal time is: 2023-09-07T00:03:16Z
[Tue Jul 11 00:03:11 UTC 2023] Add '--force' to force to renew.
[Tue Jul 11 00:03:11 UTC 2023] _renewServer
[Tue Jul 11 00:03:11 UTC 2023] Skip, Next renewal time is: 2023-09-07T00:03:19Z
[Tue Jul 11 00:03:11 UTC 2023] Add '--force' to force to renew.
I think we can ignore the errors here (since they relate to the temporarily-named devforum
certs that were added when first creating the forum instance in March).
Besides that, the other certificates seem to have successfully renewed on July 9th, given that the renewal times were updated to September 7th.
That said, these dates don't line up at all with the expiry dates and forum outages in the past (July+September vs. May+August+October).
That said, these dates don't line up at all with the expiry dates and forum outages in the past (July+September vs. May+August+October).
Here is my theory: What if the certificates are renewing just fine (every 60 days, i.e. 30 days before the 90-day expiry date), but the renewed certs aren't being loaded by OpenSMTPD? As far as the timeline goes, the upcoming renewal date (September 7th) is 30 days before the upcoming expiry date (October 7th). And both of these dates were set on July 9th, which is far before the most recent outage occurred.
My thinking here is:
This doesn't completely explain the mismatched timeline in https://github.com/spinalcordmri/spinalcordmri.github.io/issues/83#issue-1884543083, but it's a hypothesis we can test, at least:
- Given that the cert is set to renew tomorrow, I can check in on how the renewal goes (and whether the dates on the cert change on disk).
Yep! I checked acme.sh.log
and the certs were downloaded successfully. Then, I checked the new expiry dates, and got:
root@forum:/var/discourse# cd /var/discourse/shared/standalone/ssl
root@forum:/var/discourse/shared/standalone/ssl# openssl x509 -enddate -noout -in forum.spinalcordmri.org.cer
notAfter=Dec 5 23:20:15 2023 GMT
I expect the forum to experience an outage sometime around the old expiry date (October 7th). I'm going to set a calendar event for this, and watch the site like a hawk. Then, if/when the outage occurs, I'm going to simply restart the OpenSMTPD service. If that fixes the issue, then I'll automate the service-restarting, and the outages should go away entirely.
I expect the forum to experience an outage sometime around the old expiry date (October 7th)
Well, what do you know! The emails have begun failing at the predicted time:
I tried my predicted solution of sudo systemctl restart opensmtpd
, and hey, what do you know! Emails are sending again.
Now we just need to automate this and we should be good to go. :)
amazing! thank you @joshuacwnewton 😊
I've set up a cron
job to reload the certificates on the 1st of every month. This should fix the problem, but I'll be watching the forum like a hawk when the next expiry date is coming up (Dec 5 23:20:15 2023 GMT) to make sure.
crontab -e # nb: ssh logins to the forum server use `root`, so sudo not necessary here
# then, enter in: '01 01 01 * * systemctl restart opensmtpd'
One quirk that I tried to reason out was: Cron jobs work on a monthly basis ("Day of month"), while automatic cert renewal happens 30 days before the expiry date. This means that the new expiry date drifts a little bit each time (Oct 7 -> Dec 5 -> Feb 3).
So, I was thinking, "if we're reloading the certs on the first of each month, could this ever line up in a way that wouldn't work?" But, because the renewed dates differ by 60 days (90 day validity period - "30-day-before" renewal), reloading certs once a month should always keep things up to date.
This issue is tangential to https://github.com/spinalcordmri/spinalcordmri.github.io/issues/79. That issue is about monitoring email outages of any kind. However, to mitigate outages specifically related to SSL certificate renewal, we should get to the bottom of why the SSL certificate failed to auto-renew in the first place.
1st outage
The expiry date for the very first cert (first email outage) was:
This seems to line up with when emails began getting dropped:
This outage was caught and fixed on May 11th, 2023.
2nd outage
After renewal, the expiry date for the new cert (i.e. second email outage) was:
This doesn't quite make sense to me, as LetsEncrypt SSL certs should last for 90 days. May 11th + 90 days = August 9th, which is also when the forum began dropping emails a second time:
The second outage was caught and fixed on August 18th, 2023.
Next outage
If we check the current cert, we see:
This is again a bit strange, since August 18th + 90 days = November 16th instead. Still, even if the auto-renewal issues are fixed, we should keep an eye out on these various dates, and perhaps set up some sort of reminder.