kaihendry commented 5 years ago

Currently our Microservice architecture looks like:

Enterprise API sends payload async via lambda interface
lambda2sns uses request response to execute payload whilst giving feedback to unee_t_enterprise database
MEFE registering the payload and triggering over HTTP
{Unit,Invite} lambdas

The problem is that if there is bulk action from the Enterprise API, lambda2sns may receive 100s of payloads via the lambda interface which it can by default concurrently.

This overwhelms MEFE in two ways: a. MEFE runs out of memory with such a high request count b. When MEFE triggers the lambda in step 4, the database can get blocked since there are so many incoming connections

MEFE currently fails by returning a 5xx which results in lamda2sns retrying the payload 3 times and then putting the message into a DLQ.

Introducing a SQS queue

SQS is complex to orchestrate
Aurora can only trigger a lambda, so we need to create an initial "enqueue lambda" just to place a payload on SQS, which would probably mean updating the database schema
lambda2sns uses a lambda interface, it would probably need to be changed to handle the SQS event/message style
the "enqueue lambda" probably should update unee_t_enterprise database, so that it knows whether messages are in the queue, being processed, or in the DLQ for that matter

A queue would solve our spikes and allow to specify longer gaps between retries

Right now we only have lambda concurrency limits and the default async retry behaviour to fall back on. This is too naive to cover all the cases.

kaihendry commented 5 years ago

Just noticed a simpler way of putting the DLQ messages back on the SQS from alambda. The we can make alambda to consume the SQS messages, hopefully without any interface changes.

This has the advantage of being simpler to implement than the aforementioned required changes, however this would still mean that MEFE will likely be overloaded.

kaihendry commented 5 years ago

I've configured in the dev environment alambda_simple's DLQ to relay to a SQS, which should then relay back to alambda_simple via the configured Lambda trigger.

Now this needs testing @franck-boullier

kaihendry commented 5 years ago

This change has rolled on to the dev environment. We need some more testing in dev. I was testing with old DLQ messages, but we probably now should move to "mass deassignment/reassignment via the UNTE interface". @franck-boullier

Furthermore need to create the queue configurations in the other environments.

kaihendry commented 5 years ago

dev/demo now have a queue as described upon https://blog.deleu.dev/leveraging-aws-sqs-retry-mechanism-lambda/

once prod is switched over, I'll close. Want to do some more testing over the weekend.

unee-t / lambda2sqs

Reasons for introducing a queue #21

Introducing a SQS queue

A queue would solve our spikes and allow to specify longer gaps between retries