jovotech / jovo-framework

🔈 The React for Voice and Chat: Build Apps for Alexa, Messenger, Instagram, the Web, and more
https://www.jovo.tech
Apache License 2.0
1.67k stars 310 forks source link

Lambda Timeouts #573

Closed toreyjs closed 2 years ago

toreyjs commented 4 years ago

I'm submitting a...

Expected Behavior

Builds should finish in ~1s without erroring/timing out

Current Behavior

Somewhat randomly (though often in groups) error spikes will happen, accompanied by time outs.

Average error rate seems to be ~2-3% per day according to Alexa Dev Console session metrics (although it has gone above it by a lot for a few days, higher lambda timeouts seem to help a bit)

What I've Tried

I narrowed down what external services are used in my skill to: external assets, dynamoDB, analytics.

Based on logs, it does seem to save (Saved user: [id]), analytics return, it sends a response (Response JSON (jovo-framework) / Response (framework)) and ends with -- middleware 'platform.output' done - however at this point it eventually times out (based on how long my lambda timeout is, currently 8s).

Our working guess is that despite a response being sent the lambda never closes, which eventually sends and END RequestId: [id], causing an error despite everything going alright. However, nothing other than the lambda seem to be an issue when tested (everything seems to return, DB metrics look fine, mp3 files are only used on device side).

Small side note is that we have two jovo skills, and interestingly the large error / duration spikes happen at almost exactly the same time / amount (looking at cloudwatch metrics). Another skill the client has also uses Jovo and while it does have timeouts, they don't mirror the results like my two skills do. Note: My two skills use separate databases / lambdas, but do use the same server for assets (though nothing fetched server side) and same analytics platform (though again it always returns way before timeout); nothing else should be shared.

Logs

Logs look something like this (with DEBUG level on):

@timestamp,@message
(NOTE: the line below is a log I have at the top of app.js, right before 'new App()')
18:00:35.004,"2019-08-21T18:00:35.003Z  [id]    process.env.DEBUG = false, global.gAppMode = production
18:00:34.421,"END RequestId: [id]
18:00:34.421,"REPORT RequestId: [id]    Duration: 8005.38 ms    Billed Duration: 8000 ms    Memory Size: 128 MB Max Memory Used: 53 MB  
18:00:34.421,"2019-08-21T18:00:34.420Z [id] Task timed out after 8.01 seconds

18:00:27.724,"[id]  -- middleware 'platform.output' done (38 ms)
18:00:27.724,"[id]  
18:00:27.724,"[id]  Response  (framework) -------------------------------------------------
18:00:27.724,"[id]  
18:00:27.724,"[id]  
18:00:27.724,"[id]    -- Response JSON (jovo-framework) ----------------------------------------
18:00:27.724,"[id]  
18:00:27.724,"[id]  {
18:00:27.724,"[id]      ""version"": ""1.0"",
18:00:27.724,"[id]      ""response"": {
18:00:27.724,"[id]          ""shouldEndSession"": false,
18:00:27.724,"[id]          ""outputSpeech"": { ... },
18:00:27.724,"[id]          ""reprompt"": { ... },
18:00:27.724,"[id]          ""directives"": [ ... ],
18:00:27.724,"[id]          ""card"": { ... }
18:00:27.724,"[id]      },
18:00:27.724,"[id]      ""sessionAttributes"": {
18:00:27.724,"[id]          ""_JOVO_STATE_"": ""PlayerCountSelectState""
18:00:27.724,"[id]      }
18:00:27.724,"[id]  }
18:00:27.724,"[id]  -- middleware 'response' done (1 ms)
18:00:27.647,"[id]   this.$user.$metaData
18:00:27.647,"[id]  -- middleware 'user.save' done (260 ms)
18:00:27.647,"[id]  
18:00:27.647,"[id]  Output object: this.$output (framework) -------------------------------
18:00:27.647,"[id]  
18:00:27.647,"[id]  {
18:00:27.647,"[id]      ""Alexa"": {
18:00:27.647,"[id]          ""DisplayTemplate"": { ... },
18:00:27.647,"[id]          ""StandardCard"": { ... },
18:00:27.647,"[id]          ""Directives"": [ ... ]
18:00:27.647,"[id]      },
18:00:27.647,"[id]      ""ask"": { ... }
18:00:27.647,"[id]  }
18:00:27.646,"[id]  
18:00:27.646,"[id]  Jovo user: (save)  (framework) ----------------------------------------
18:00:27.646,"[id]  
18:00:27.646,"[id]   Saved user: [account]
18:00:27.646,"[id]   this.$user.$context
18:00:27.646,"[id]  {
18:00:27.646,"[id]      ""prev"": [ ... ]
18:00:27.646,"[id]  }
18:00:27.646,"[id]   this.$user.$data
18:00:27.646,"[id]  { ... }
18:00:27.387,"[id]  
18:00:27.387,"[id]    -- ON_REQUEST ------------------------------------------------------------
18:00:27.387,"[id]  
18:00:27.387,"[id]   ON_REQUEST (0 ms)
18:00:27.387,"[id]  
18:00:27.387,"[id]  Handle  (framework) ---------------------------------------------------
18:00:27.387,"[id]  
18:00:27.387,"[id]  
18:00:27.387,"[id]  -- middleware 'handler' done (160 ms)
18:00:27.386,"2019-08-21T18:00:27.385Z  [id]    startState: PlayerCountSelectState
18:00:27.386,"[id]   toIntent: Start
18:00:27.386,"[id]   NEW_SESSION (159 ms)
18:00:27.328,"2019-08-21T18:00:27.327Z  [id]    startState: StartGame
18:00:27.328,"[id]   toIntent: Start
18:00:27.268,"[id]   toIntent: Start
18:00:27.228,"2019-08-21T18:00:27.228Z  [id]    startState: Greeting
18:00:27.227,"[id]  {
18:00:27.227,"[id]      ""type"": ""LAUNCH"",
18:00:27.227,"[id]      ""path"": ""LAUNCH""
18:00:27.227,"[id]  }
18:00:27.227,"[id]  -- middleware 'router' done (2 ms)
18:00:27.227,"[id]  
18:00:27.227,"[id]  Jovo handler  (framework) ---------------------------------------------
18:00:27.227,"[id]  
18:00:27.227,"[id]  
18:00:27.227,"[id]    -- NEW_SESSION -----------------------------------------------------------
18:00:27.227,"[id]  
18:00:27.226,"[id]  
18:00:27.226,"[id]  Jovo router  (framework) ----------------------------------------------
18:00:27.226,"[id]  
18:00:27.226,"[id]  Route object:
18:00:27.225,"[id]  this.$user.new = false
18:00:27.225,"[id]  {
18:00:27.225,"[id]      ""prev"": [ ... ]
18:00:27.225,"[id]  }
18:00:27.225,"[id]  this.$user.$data
18:00:27.225,"[id]  { ... }
18:00:27.225,"[id]  this.$user.$metaData
18:00:27.225,"[id]  -- middleware 'user.load' done (640 ms)
18:00:27.225,"[id]  -- middleware 'before.router' done (0 ms)
18:00:27.168,"[id]  
18:00:27.168,"[id]  Jovo user (load) (framework) ------------------------------------------
18:00:27.168,"[id]  
18:00:27.168,"[id]  this.$user.getId(): [id]
18:00:26.588,"[id]  -- middleware 'platform.nlu' done (37 ms)
18:00:26.588,"[id]  
18:00:26.588,"[id]  After nlu  (framework) ------------------------------------------------
18:00:26.588,"[id]  
18:00:26.588,"[id]  this.$nlu : undefined
18:00:26.588,"[id]  this.$inputs : {}
18:00:26.548,"[id]  -- middleware 'after.platform.init' done (1 ms)
18:00:26.548,"[id]  -- middleware 'platform.init' done (3 ms)
18:00:26.548,"[id]  
18:00:26.548,"[id]  After init  (framework) -----------------------------------------------
18:00:26.548,"[id]  
18:00:26.548,"[id]  this.$alexaSkill initialized
18:00:26.548,"[id]  this.$type: {""type"":""LAUNCH""}
18:00:26.548,"[id]  this.$session.$data : {}
18:00:26.545,"[id]  
18:00:26.545,"[id]    -- Request JSON (jovo-framework) -----------------------------------------
18:00:26.545,"[id]  
18:00:26.545,"[id]  {
18:00:26.545,"[id]      ""version"": ""1.0"",
18:00:26.545,"[id]      ""session"": { ... },
18:00:26.545,"[id]      ""context"": { ... },
18:00:26.545,"[id]      ""request"": { ... }
18:00:26.545,"[id]  }
18:00:26.545,"[id]  -- middleware 'request' done (57 ms)
18:00:26.488,"[id]  -- middleware 'setup' done (69 ms)
18:00:26.424,"[id]  Adding resources to $cms object:
18:00:26.424,"[id]  {}
18:00:26.418,"[id]  
18:00:26.418,"[id]  Start request (framework) ---------------------------------------------
18:00:26.418,"[id]  
18:00:26.415,"START RequestId: [id] Version: $LATEST

==Conclusion== I'm not sure if I am perhaps missing something obvious from the logs or if it's some internal Jovo thing I may not have accounted for, but after working with amazon a bit on trying to narrow down the issue we decided it would be best to reach out to you. Any help would of course be appreciated.

Your Environment

"jovo-db-dynamodb": "^2.2.0",
"jovo-framework": "^2.1.1",
"jovo-platform-alexa": "^2.1.2",
"jovo-plugin-debugger": "^2.1.1",
aswetlow commented 4 years ago

Thanks for reaching out. Sorry for the late reply (I was on vacation :))

I heard about the issue but haven't been able to reproduce it. Will run some tests at the weekend.

It's really weird that it occurs in two independent lambdas at the same time.

toreyjs commented 4 years ago

Thanks, let me know if you need any additional details.

aswetlow commented 4 years ago

Any news on this? I just found out that I had several timeouts in one of my skills. It was an erroneous API call that took longer than expected.

Do you call any external APIs that may take longer?

toreyjs commented 4 years ago

So far no solution; increasing the lambda's memory seems to have helped, but we've also had a decrease in users around the same time, so it's hard to know for sure which caused a decrease in errors. Only APIs we used were analytics, and were turned off to no noticeable effect.

After looking at x-ray logs (which mostly error out at AWS:Lambda), we're looking into database stuff again (though I'm not overly hopeful)

gshenar commented 4 years ago

I have been having these issues as well. The error rate is low, but its quite noticeable given that I'm handling millions of calls a month. I tried removing dashbot - doesn't seem to have been the fix. There aren't really any external calls I'm doing besides dynamo through the framework

jankoenig commented 4 years ago

Hey @gshenar, could you take a look at your DynamoDb monitoring? We've had issues in the past where read/write capacity was reached during peak times

gshenar commented 4 years ago

Yeah here's a screenshot. I have lots of alarms set up and it hasn't been going over. image

klapper42 commented 4 years ago

I had a similar problem with jovo-db-mysql (reproducible). Not sure what config triggered the issue, but it looked like the connection pooling held the lambda function from terminating. What helped was:

# src/index.js
exports.handler = async (event, context, callback) => {
    context.callbackWaitsForEmptyEventLoop = false;
    await app.handle(new Lambda(event, context, callback));
};
r-pai commented 4 years ago

I had a similar problem with jovo-db-mysql (reproducible). Not sure what config triggered the issue, but it looked like the connection pooling held the lambda function from terminating. What helped was:

# src/index.js
exports.handler = async (event, context, callback) => {
    context.callbackWaitsForEmptyEventLoop = false;
    await app.handle(new Lambda(event, context, callback));
};

Thanks @klapper42 , the solution you provided solved my issue.

My use case was a basic one, I wanted my hello world jovo project to run in AWS lambda and for session data instead of filedDB chose Mysql DB. When added to run in lambda ,it timed out. But when I run in my local with MYSQL database there was no issues.

Steps: 1) Create new jovo project with language as typescript 2) Followed steps in https://www.jovo.tech/docs/databases/mysql 3) Ran in local environment - success 4) Ran using AWS lambda, it timed out. Note 15 sec is the max

{ "errorMessage": "2020-04-25T15:08:59.734Z e7b73234-1977-4b7f-a3cf-939f55da2532 Task timed out after 15.02 seconds" }

aswetlow commented 2 years ago

I close this issue due to incativity.