Monitor `forum.spinalcordmri.org` email notification service and better detect email failures

spinalcordmri / spinalcordmri.github.io

Web site of spinalcordmri organization.

https://spinalcordmri.github.io/

0 stars 3 forks source link

Monitor `forum.spinalcordmri.org` email notification service and better detect email failures #79

Open joshuacwnewton opened 1 year ago

joshuacwnewton commented 1 year ago

We use Discourse as our forum software, and set up our own custom mail server within the DigitalOcean VM.

Starting on March 19th, 2023, we experienced an outage of outbound emails that wasn't noticed until May 11th.

Thankfully, the outage was easy to fix, as it was caused by an expired SSL cert. But, the concern here is how long it took for the outage to be caught. Here are some of the factors involved in the outage:

Both myself and Julien have the "mailing list" mode turned on, which alerts us of every new post on the forum.
These "mailing list" notifications are mirrored to the Slack channel #sct_forum_updates.
We receive ~0-2 new forum posts weekly. Receiving 0 post notifications over 2 months should have raised suspicions. (But, I imagine it's difficult to notice an absence of something. To me, it felt as though the forum was just in a quiet period.)
The admin dashboard at the time listed a warning about the email job failures. (But, we have no protocol for regularly checking the admin dashboard. I only tend to check it when I visit the forum. But, with the outage of notifications, I didn't have any reason to check the forum!)

So, to better address this in the future, we would need to find some way to monitor emails and detect failures:

Automated: e.g. have some sort of canary email that gets sent out daily, so that when emails fail, it's more noticeable/expected.
Manual: e.g. Have some sort of assigned task/automated reminder for an admin to check the admin dashboard on a daily/weekly schedule.

namgo commented 1 year ago

I'm following up on this as part of my task list. I'm hesitant to suggest the manual option, since it adds workload that IMO should be automatable.

Is there a reason you're sending outbound emails directly from the droplet? In the past when I've set up Discourse, I used a bulk mail provider and this would get around the possibility of mail getting dropped.

Regardless, I can set up a weekly canary email so it doesn't get overwhelming via cronjob with sendmail, would this help?

joshuacwnewton commented 1 year ago

Is there a reason you're sending outbound emails directly from the droplet? In the past when I've set up Discourse, I used a bulk mail provider and this would get around the possibility of mail getting dropped.

cc'ing @kousu who I believe set up the mail server originally.

Regardless, I can set up a weekly canary email so it doesn't get overwhelming via cronjob with sendmail, would this help?

That would be lovely! :)

kousu commented 1 year ago

Is there a reason you're sending outbound emails directly from the droplet?

We were using SendGrid (it was SendGrid right?) and they started dropping mail -- I think we forgot to renew the credit card. I figured DigitalOcean allows outgoing SMTP and were already paying for it so I used that. It's one less point of failure. Plus then we're not rewarding protection rackets

I watched the mail logs for a few weeks after and only a few mails were getting delayed or dropped and I was able to fix up each problem.

SendGrid doesn't guarantee delivery! Email can still get flagged as spam. The sending IP is just one signal into the spam filters. At this point, our IP reputation should be as good as SendGrid or any other SMTP hoster -- though I don't know what dashboard I could look at to confirm that.

weekly canary email

that's a good idea!

here's another: a cronjob to grep the maillog for codes 250 and 5*, and make a histogram of them (| sort | uniq -c | sort -n). If the 500s spike or the 200s drop we investigate.

namgo commented 1 year ago

@kousu How would we access that histogram?

kousu commented 1 year ago

If it's in a cronjob run by root, and you have an email in ~root/.forward, then the stdout of that cronjob will go to that email.

joshuacwnewton commented 1 year ago

Emails have started failing once more...

The issue is SSL certificate related once more:

The steps that I used to fix this problem previously are:

Clear out space prior to rebuilding the Discourse app.
- sudo journalctl --vacuum-size=100M
- Delete the 3 oldest backups in Discourse UI
- Run /var/discourse/launcher cleanup, answering y to both.
- Ensure df -h shows >7GB of space for the 25GB /dev/vdal drive.
Follow steps on https://meta.discourse.org/t/how-to-force-letsencrypt-cert-renewal/143223
Restart OpenSMTPD process (sudo systemctl restart opensmtpd)

Working on this now.

Still, it looks like we'll need to get to the bottom of why the SSL certificates aren't auto-renewing, when they absolutely should be.

Potentially related:

joshuacwnewton commented 1 year ago

Email service has been restored. The steps above worked like a charm. :)