unee-t / lambda2sqs

Relays SQL triggered payloads to MEFE via a queue
https://ap-southeast-1.console.aws.amazon.com/lambda/home?region=ap-southeast-1#/applications/lambda2sqs
GNU Affero General Public License v3.0
0 stars 4 forks source link

Retry internal API logic doesn't work as expected #20

Closed nbiton closed 5 years ago

nbiton commented 5 years ago

A payload to create a unit yielded the following audit record:

{
    "_id" : "24fcf535-9035-11e9-b93f-068ed32d0d10_CREATE_UNIT",
    "payload" : {
        ...[redacted]
        "mefeAPIRequestId" : "24fcf535-9035-11e9-b93f-068ed32d0d10",
        "unitCreationRequestId" : 1325
    },
    "acceptedAt" : ISODate("2019-06-16T13:14:52.178Z"),
    "blockedResponsesCount" : 3,
    "done" : true,
    "respondedAt" : ISODate("2019-06-16T13:14:55.536Z"),
    "response" : {
        "code" : 201,
        "data" : {
            "unitMongoId" : "BCX8yuTV7WNUZHsks",
            "timestamp" : "2019-06-16T13:14:55.536Z"
        },
        "headers" : {

        }
    },
    "cachedResponsesCount" : 318
}

This means the request took ~3.3s and did not fail once (no failed 500 responses). However, while it was still processing, the initiating lambda retried 3 times (blockedResponsesCount) and after it returned a successful response, it was retried another 318 times! :flushed: (cachedResponesCount). Each of those 318 requests received a successful response (201). And after all of that, it still didn't update the enterprise database that the unit creation was completed successfully.

This was on PROD.

nbiton commented 5 years ago

My own guess is that the lambda might not recognize a 201 response code as successful, and will cause this issue

kaihendry commented 5 years ago

Just to be clear, is this cachedResponsesCount is counting the times the lambda called /api/db-change-message/process?

https://github.com/unee-t/lambda2sns/blob/master/functions/simple/main.go#L108

nbiton commented 5 years ago

cachedResponsesCount is counting how many times a response was returned after the first "resolved" response for this payload (anything but 500+) was returned. All the cached responses would be the same as the first resolved response, as the response is stored once and then fetched from the db for each subsequent request

kaihendry commented 5 years ago

Looking at the production logs: https://ap-southeast-1.console.aws.amazon.com/cloudwatch/home?region=ap-southeast-1#logEventViewer:group=/aws/lambda/alambda_simple;filter=%2224fcf535-9035-11e9-b93f-068ed32d0d10%22;start=P2W

I can see the error "error": "Error 1062: Duplicate entry 'BCX8yuTV7WNUZHsks' for key 'unique_mefe_unit_id'", for the payload:

"sql": "SET @unit_creation_request_id = 4064;\nSET @mefe_unit_id = 'BCX8yuTV7WNUZHsks';\nSET @creation_datetime = '2019-06-16 13:14:55';\nSET @is_created_by_me = 1;\nSET @mefe_api_error_message = '';\nCALL ut_creation_unit_mefe_api_reply;",

Happening again and again. Since the SQL failed, lambda2sns just retries again and again upto 3 times IIRC.

franck-boullier commented 5 years ago

I can see the error "error": "Error 1062: Duplicate entry 'BCX8yuTV7WNUZHsks' for key 'unique_mefe_unit_id'".

@kaihendry in this instance, the SQL is not "failing": the SQL routine identifies that the upstream mechanism (the lambda) is trying to create the same unit over and over again. The SQL side then correctly refuses to process the duplicate requests <--- this is the expected behavior.

What's unexpected here is that the same request is sent multiple time after that request was correctly processed by the SQL side.

franck-boullier commented 5 years ago

@kaihendry my suggestion would be to make sure that the golang script correctly interprets the error code Error 1062: Duplicate entry and responds accordingly ---> there no need to insert a new record for that MEFE unit_id in the SQL again since it exists already.

kaihendry commented 5 years ago

So in this specific case, I check the error message for Error 1062, log the error and stop.

I.e. I do not return the error like I do currently.

franck-boullier commented 5 years ago

So in this specific case, I check the error message for Error 1062, log the error and stop. I.e. I do not return the error like I do currently.

That's probably a good first step yes.