unee-t / lambda2sqs

Relays SQL triggered payloads to MEFE via a queue
https://ap-southeast-1.console.aws.amazon.com/lambda/home?region=ap-southeast-1#/applications/lambda2sqs
GNU Affero General Public License v3.0
0 stars 4 forks source link

RDS keeps disappearing when trying to do operations in batch #18

Closed franck-boullier closed 5 years ago

franck-boullier commented 5 years ago

The problem:

A few days ago, when I was doing a mass assignment of a user to several units via the Unee-T Enterprise interface, things were working as intended (in the DEV/Staging):

Now, when I'm trying to do a mass assignment of a user to several units via the Unee-T Enterprise interface, this is NOT working as it should:

More information:

This problem started to appear after this commit was rolled out.

See slack conversation.

kaihendry commented 5 years ago

The main change I made was optimise the DB connection. So now there is no explicit close and open, ~2s between function invocations.

Currently I'm struggling with https://enterprise.dev.unee-t.com/enterprise/menu.php

franck-boullier commented 5 years ago

Currently I'm struggling with https://enterprise.dev.unee-t.com/enterprise/menu.php

is accessible for me. What seems to be the issue?

kaihendry commented 5 years ago
[2019-05-20T11:47:58+08:00] (ecs/bugzilla/bf7758eb-b991-4b29-9463-a7ee1ce515d8) [Mon May 20 03:47:58.193438 2019] [:error] [pid 32] \nCan't connect to the data
base.\nError: Can't connect to MySQL server on 'auroradb.dev.unee-t.com' (111)\n  Is your database installed and up and running?\n  Do you have the correct use
rname and password selected in localconfig?\n\n

Not sure if it makes sense rolling back this change which is admittedly faster than it was before, though functionally the same. It's just exposed other issues in the system.

https://media.dev.unee-t.com/2019-05-20/processlist.txt

$ grep sort processlist.txt  | wc -l
78

I am worried about the Bugzilla doing stuff as well as invites saying Error 1213: Deadlock found when trying to get lock; try restarting transaction [Invite API Lambda error]

To make the RDS more available whilst expensive queries and sort indexes are being generated, I suggest we can:

kaihendry commented 5 years ago

Prod has been rolled back to version=41 which maps to https://github.com/unee-t/lambda2sns/commit/49a8c341fad220b213e6c495839c40086fb3bb36

It's not as fast due to setup/teardown of the SQL connection and doesn't return the error properly in some cases to retry.

kaihendry commented 5 years ago

lambda2sns refactor in dev has shown at least to me, the changes are fine.

RE pressure on the database, we need to tweak concurrency so that it doesn't overload the database.

The real issue as why the RDS "keeps disappearing" is probably Creating sort index as seen in https://media.dev.unee-t.com/2019-05-20/processlist.txt