jezzsantos / saastack

A comprehensive codebase template for starting your real-world, fully featured SaaS web products. On the .NET platform
The Unlicense
15 stars 5 forks source link

Email Delivery Reliability #21

Open jezzsantos opened 6 months ago

jezzsantos commented 6 months ago

Present Day

At the moment the EmailNotificationService packages up the request to deliver an email, and drops it on the "emails" queue. To be dealt with asynchronously.

An AzureFunction/Lambda is triggered and picks up the message, and sends it to the Ancillary API. The message is then sent to a 3rd party service (e.g. MailGun, SendGrid, etc.) During this delivery step, delivery and network problems are dealt with with retries and backoffs etc (3 retries with exponential jittered backoff). If the message fails delivery (including backoffs etc), the AzureFunction/Lambda will retry several times over the course of the next few minutes (5 times is the default). If the message is not delivered (i.e. the API call does not return HTTP-200, then the AzureFunction/Lambda will place the message reliably on the poison queue. Alerts should be raised and a manual process must be deployed to resolve.

Problem

Email delivery is a business critical function, and even though we have a reliable asynchronous mechanism in place right now, there is little data tracking the whole process. It is possible that the queued message is lost in the process (i.e. deleted form the queue by an operator, or when limits are inadvertently reached). When this happens the system has no record of the email, and fixing it in a production support scenario, will be hard (not impossible) to detect or resolve.

It is possible to track the email from it inception, and through the synchronous process given its unique MessageId.

To do this, better, we would need to capture the following events:

  1. When the email was scheduled for delivery , before it appears on the queue.
  2. When the email was picked off the queue and an attempt was made to deliver it
  3. If and when an attempt to deliver it failed or not
  4. When the delivery succeeds
  5. Later, when we hear back (via webhook) from 3rd party the status of the email delivery, as it can still fail in the 3rd party (i.e. blocked email domains etc)

All these events should be captured in the backend API in the Ancillary domain.

jezzsantos commented 6 months ago

What's left to do is:

  1. Consider adding a pre-step to putting the message on the queue (scheduling) where we already track the messageId. It is possible that messages are lost on the queue in production scenarios.
  2. Demonstrating a real-world delivery adapter (i.e. MailGun, SendGrid)
  3. Demonstrating a webhook callback (i.e. from MailGun) that changes the state of the delivery. Ideally demonstrating some new kind of authorization (like HMAC or something)