rabblerouser / core

Pluggable, extensible membership database for community organising
GNU Affero General Public License v3.0
17 stars 10 forks source link

Deal with "poisoned" events #132

Open camjackson opened 7 years ago

camjackson commented 7 years ago

Right now whenever a service fails to process an event from the stream, that failure goes all the way back to the kinesis subscription, which causes AWS to send you the same event again. It will keep retrying (with exponential backoff) until the event is processed successfully.

This is great for consistency and resilience, but it presents a problem - what if somehow an event ends up in the stream which can never succeed? For example, a registration event where the email address is malformed will be retried forever, because the event cannot be sent successfully. The event "poisons" the stream, and no new data will be able to be processed until it falls off the stream (probably 24 hours later).

The exact solution is not clear, but we need a way for these poisoned events to not cause event processing to grind to a halt.

camjackson commented 7 years ago

Suggestion: there are three different cases:

  1. The service responds to the lambda with a 2xx status code. This means the event has been processed successfully and the lambda should report a success to AWS.
  2. The service responds to the lambda with a 5xx status code. This means that it's the services own fault for not processing that event (maybe it ran out of memory), and the lambda should report an error to AWS, so that the event is retried.
  3. The service responds to the lambda with a 4xx status code. This means that the event itself is bad, and should not be sent again (because it will always be bad). In this case, even though a failure occurred, the lambda should report a success to AWS, so that the event does not get retried.

A few things are still unclear: