spinalcordmri / spinalcordmri.github.io

Web site of spinalcordmri organization.
https://spinalcordmri.github.io/
0 stars 3 forks source link

Monitor `forum.spinalcordmri.org` email notification service and better detect email failures #79

Open joshuacwnewton opened 1 year ago

joshuacwnewton commented 1 year ago

We use Discourse as our forum software, and set up our own custom mail server within the DigitalOcean VM.

Starting on March 19th, 2023, we experienced an outage of outbound emails that wasn't noticed until May 11th.

Thankfully, the outage was easy to fix, as it was caused by an expired SSL cert. But, the concern here is how long it took for the outage to be caught. Here are some of the factors involved in the outage:

So, to better address this in the future, we would need to find some way to monitor emails and detect failures:

namgo commented 1 year ago

I'm following up on this as part of my task list. I'm hesitant to suggest the manual option, since it adds workload that IMO should be automatable.

Is there a reason you're sending outbound emails directly from the droplet? In the past when I've set up Discourse, I used a bulk mail provider and this would get around the possibility of mail getting dropped.

Regardless, I can set up a weekly canary email so it doesn't get overwhelming via cronjob with sendmail, would this help?

joshuacwnewton commented 1 year ago

Is there a reason you're sending outbound emails directly from the droplet? In the past when I've set up Discourse, I used a bulk mail provider and this would get around the possibility of mail getting dropped.

cc'ing @kousu who I believe set up the mail server originally.

Regardless, I can set up a weekly canary email so it doesn't get overwhelming via cronjob with sendmail, would this help?

That would be lovely! :)

kousu commented 1 year ago

Is there a reason you're sending outbound emails directly from the droplet?

We were using SendGrid (it was SendGrid right?) and they started dropping mail -- I think we forgot to renew the credit card. I figured DigitalOcean allows outgoing SMTP and were already paying for it so I used that. It's one less point of failure. Plus then we're not rewarding protection rackets

I watched the mail logs for a few weeks after and only a few mails were getting delayed or dropped and I was able to fix up each problem.

SendGrid doesn't guarantee delivery! Email can still get flagged as spam. The sending IP is just one signal into the spam filters. At this point, our IP reputation should be as good as SendGrid or any other SMTP hoster -- though I don't know what dashboard I could look at to confirm that.

weekly canary email

that's a good idea!

here's another: a cronjob to grep the maillog for codes 250 and 5*, and make a histogram of them (| sort | uniq -c | sort -n). If the 500s spike or the 200s drop we investigate.

namgo commented 1 year ago

@kousu How would we access that histogram?

kousu commented 1 year ago

If it's in a cronjob run by root, and you have an email in ~root/.forward, then the stdout of that cronjob will go to that email.

joshuacwnewton commented 1 year ago

Emails have started failing once more...

image

The issue is SSL certificate related once more:

image

The steps that I used to fix this problem previously are:

Working on this now.


Still, it looks like we'll need to get to the bottom of why the SSL certificates aren't auto-renewing, when they absolutely should be.

Potentially related:

joshuacwnewton commented 1 year ago

Email service has been restored. The steps above worked like a charm. :)