Deal with "poisoned" events

Right now whenever a service fails to process an event from the stream, that failure goes all the way back to the kinesis subscription, which causes AWS to send you the same event again. It will keep retrying (with exponential backoff) until the event is processed successfully.

This is great for consistency and resilience, but it presents a problem - what if somehow an event ends up in the stream which can never succeed? For example, a registration event where the email address is malformed will be retried forever, because the event cannot be sent successfully. The event "poisons" the stream, and no new data will be able to be processed until it falls off the stream (probably 24 hours later).

The exact solution is not clear, but we need a way for these poisoned events to not cause event processing to grind to a halt.

Suggestion: there are three different cases:

The service responds to the lambda with a 2xx status code. This means the event has been processed successfully and the lambda should report a success to AWS.
The service responds to the lambda with a 5xx status code. This means that it's the services own fault for not processing that event (maybe it ran out of memory), and the lambda should report an error to AWS, so that the event is retried.
The service responds to the lambda with a 4xx status code. This means that the event itself is bad, and should not be sent again (because it will always be bad). In this case, even though a failure occurred, the lambda should report a success to AWS, so that the event does not get retried.

A few things are still unclear:

The consuming service (the one that's ultimately handling the event) shouldn't have to care about HTTP status codes, lambda errors etc. Right now all you do is return a resolved or rejected promise and the stream-client library handles the mapping for you. It would be good to keep that pattern, but we might need some special way to indicate that two different types of failure that can occur (retry vs. don't retry). Or alternately, perhaps the consuming service should just return success if the event is invalid. That seems a bit wrong though.
In the second case, the 500 might be the result of a bug that will always cause that particular event to fail. If that happens, then the original problem still occurs, where that event will be retried forever (or at least until it falls off the end of the stream after 24 hours). Not sure what the solution to this is, but it's probably a separate issue
What do we do with events in the third case? Do we just assume that they're dodgy and forget about them? Or do we store them in a dead letter queue to be looked at 'later'? Again, this could be a separate issue if we think it's worth solving.

rabblerouser / core

Deal with "poisoned" events #132