Background context
On 17th April there was an issue with the appointments API that caused the webhook event handler to fail to fetch appointments for a period of about an hour. These events correctly went to the dead letter queue, however the abnormality in events hitting this queue did not trip any alarms and so there was no automatic notification about the problem. We should look to improve this mechanism, and also consider the DLQ redrive policy to automatic replay failed events once any problems have been resolved. We should also consider as part of this whether or not we should bother to dead letter 404s
It is likely we'll need to look at the structure of the dead lettered message as we won't be able to just initiate a queue redrive
Specification
Project: Reapit.Lambdas.DatabaseEventHandler
Review content of message that goes to the dead letter queue. The structure contains a rawDataObject which should really be the message body, with any additional information (status codes, ids etc) being attributes on the message. This means we'll be able to just retry events in that queue
Consider whether there's any benefit to keeping hold of events which 404'd. Other status codes, such as 400/401/403 etc are the ones we'd normally want to be retrying
A Cloudwatch alarm has been manually configured now for the dead letter queue metric "Number of messages sent" - we should pull this into the existing IaC stack and look at any other alarms that would be useful here.
Background context On 17th April there was an issue with the appointments API that caused the webhook event handler to fail to fetch appointments for a period of about an hour. These events correctly went to the dead letter queue, however the abnormality in events hitting this queue did not trip any alarms and so there was no automatic notification about the problem. We should look to improve this mechanism, and also consider the DLQ redrive policy to automatic replay failed events once any problems have been resolved. We should also consider as part of this whether or not we should bother to dead letter 404s
It is likely we'll need to look at the structure of the dead lettered message as we won't be able to just initiate a queue redrive
Specification
rawDataObject
which should really be the message body, with any additional information (status codes, ids etc) being attributes on the message. This means we'll be able to just retry events in that queue