Closed nbiton closed 5 years ago
My own guess is that the lambda might not recognize a 201 response code as successful, and will cause this issue
Just to be clear, is this cachedResponsesCount
is counting the times the lambda called /api/db-change-message/process
?
https://github.com/unee-t/lambda2sns/blob/master/functions/simple/main.go#L108
cachedResponsesCount
is counting how many times a response was returned after the first "resolved" response for this payload (anything but 500+) was returned. All the cached responses would be the same as the first resolved response, as the response is stored once and then fetched from the db for each subsequent request
Looking at the production logs: https://ap-southeast-1.console.aws.amazon.com/cloudwatch/home?region=ap-southeast-1#logEventViewer:group=/aws/lambda/alambda_simple;filter=%2224fcf535-9035-11e9-b93f-068ed32d0d10%22;start=P2W
I can see the error "error": "Error 1062: Duplicate entry 'BCX8yuTV7WNUZHsks' for key 'unique_mefe_unit_id'",
for the payload:
"sql": "SET @unit_creation_request_id = 4064;\nSET @mefe_unit_id = 'BCX8yuTV7WNUZHsks';\nSET @creation_datetime = '2019-06-16 13:14:55';\nSET @is_created_by_me = 1;\nSET @mefe_api_error_message = '';\nCALL ut_creation_unit_mefe_api_reply;",
Happening again and again. Since the SQL failed, lambda2sns just retries again and again upto 3 times IIRC.
I can see the error "error": "Error 1062: Duplicate entry 'BCX8yuTV7WNUZHsks' for key 'unique_mefe_unit_id'".
@kaihendry in this instance, the SQL is not "failing": the SQL routine identifies that the upstream mechanism (the lambda) is trying to create the same unit over and over again. The SQL side then correctly refuses to process the duplicate requests <--- this is the expected behavior.
What's unexpected here is that the same request is sent multiple time after that request was correctly processed by the SQL side.
@kaihendry my suggestion would be to make sure that the golang script correctly interprets the error code Error 1062: Duplicate entry
and responds accordingly ---> there no need to insert a new record for that MEFE unit_id in the SQL again since it exists already.
So in this specific case, I check the error message for Error 1062
, log the error and stop.
I.e. I do not return the error like I do currently.
So in this specific case, I check the error message for Error 1062, log the error and stop. I.e. I do not return the error like I do currently.
That's probably a good first step yes.
A payload to create a unit yielded the following audit record:
This means the request took ~3.3s and did not fail once (no failed 500 responses). However, while it was still processing, the initiating lambda retried 3 times (
blockedResponsesCount
) and after it returned a successful response, it was retried another 318 times! :flushed: (cachedResponesCount
). Each of those 318 requests received a successful response (201). And after all of that, it still didn't update the enterprise database that the unit creation was completed successfully.This was on PROD.