unee-t / lambda2sqs

Relays SQL triggered payloads to MEFE via a queue
https://ap-southeast-1.console.aws.amazon.com/lambda/home?region=ap-southeast-1#/applications/lambda2sqs
GNU Affero General Public License v3.0
0 stars 4 forks source link

Reasons for introducing a queue #21

Closed kaihendry closed 5 years ago

kaihendry commented 5 years ago

Currently our Microservice architecture looks like:

  1. Enterprise API sends payload async via lambda interface
  2. lambda2sns uses request response to execute payload whilst giving feedback to unee_t_enterprise database
  3. MEFE registering the payload and triggering over HTTP
  4. {Unit,Invite} lambdas

The problem is that if there is bulk action from the Enterprise API, lambda2sns may receive 100s of payloads via the lambda interface which it can by default concurrently.

This overwhelms MEFE in two ways: a. MEFE runs out of memory with such a high request count b. When MEFE triggers the lambda in step 4, the database can get blocked since there are so many incoming connections

MEFE currently fails by returning a 5xx which results in lamda2sns retrying the payload 3 times and then putting the message into a DLQ.

Introducing a SQS queue

A queue would solve our spikes and allow to specify longer gaps between retries

Right now we only have lambda concurrency limits and the default async retry behaviour to fall back on. This is too naive to cover all the cases.

kaihendry commented 5 years ago
DLQ to SQS

Just noticed a simpler way of putting the DLQ messages back on the SQS from alambda. The we can make alambda to consume the SQS messages, hopefully without any interface changes.

This has the advantage of being simpler to implement than the aforementioned required changes, however this would still mean that MEFE will likely be overloaded.

kaihendry commented 5 years ago

I've configured in the dev environment alambda_simple's DLQ to relay to a SQS, which should then relay back to alambda_simple via the configured Lambda trigger.

Now this needs testing @franck-boullier

kaihendry commented 5 years ago

This change has rolled on to the dev environment. We need some more testing in dev. I was testing with old DLQ messages, but we probably now should move to "mass deassignment/reassignment via the UNTE interface". @franck-boullier

Furthermore need to create the queue configurations in the other environments.

kaihendry commented 5 years ago

dev/demo now have a queue as described upon https://blog.deleu.dev/leveraging-aws-sqs-retry-mechanism-lambda/

once prod is switched over, I'll close. Want to do some more testing over the weekend.