slackapi / bolt-js

A framework to build Slack apps using JavaScript
https://tools.slack.dev/bolt-js/
MIT License
2.75k stars 395 forks source link

Lambda Custom Function triggered by workflow gets token_revoked error when API calls take longer than 3 seconds #2288

Open mihirkothari25 opened 1 month ago

mihirkothari25 commented 1 month ago

I created a sample custom function to demonstrate the issue we are seeing - https://github.com/mihirkothari25/bolt-js-getting-started-app. It's adapted from the sample starter app. I run ngrok and serverless-offline to run the app locally and point the event subscription URL in the slack app to the ngrok URL. I see that even after responding immediately from the function if the next interaction with Slack takes more than 3 seconds, the token_revoked error is thrown. If an API call takes longer than 3 seconds after the function has responded the first time, there will be a token_revoked error on the next complete or whatever slack function is being called.

In the real world scenario -

  1. Slack user invokes the workflow.
  2. Workflow sends an event to the /slack/events endpoint configured in API Gateway.
  3. That invokes the lambda function.
  4. The lambda function starts execution, but the API calls take over 3 seconds.
  5. The lambda function executes the complete, and it results in a token_revoked error.

To mitigate this issue, we tried to immediately respond after the lambda function starts execution. As soon as the lambda function starts executing, it would post an ephemeral message back to the user however, when the AWS Lambda cold starts, that takes too long, and the same error is encountered. If the AWS Lambda is warm, the error is less frequent.

My sample app impersonates the API call by just doing a sleep timeout. It also impersonates the AWS infrastructure by using serverless-offline.

@slack/bolt version

3.22.0

Your App and Receiver Configuration

Link to Application code - https://github.com/mihirkothari25/bolt-js-getting-started-app/blob/main/app.js

Using the AwsLambdaReceiver

Node.js runtime version

v21.7.3

Steps to reproduce:

https://github.com/mihirkothari25/bolt-js-getting-started-app/tree/main - forked and adapted starter app. Run npm install to install dependencies.

  1. Run ngrok on port 3000 ngrok http 3000
  2. Run serverless-offline with the command npm run serverless
  3. Update the Slack app event subscription URL with the URL generated by the ngrok command in step 1.
  4. Run the workflow from Slack.
  5. The Slack app will send the event to the ngrok URL from step 1.
  6. Ngrok will tunnel that event to localhost:3000 where serverless is running.
  7. Serverless will simulate invoking the lambda and execute the lambda function and respond to slack.

Expected result:

Slack should respond initially that it is executing the request and then respond with the result once it is done.

Actual result:

If the sleep is longer than 3 seconds, the function ends with the token_revoked error. This is despite responding immediately from the function and then sleeping for longer than 3 seconds.

Is the expectation that the function should keep pushing heartbeats to slack while it is finishing the execution of underlying API calls?

These are the errors -

[ERROR]  bolt-app Error: An API error occurred: token_revoked
    at Jae (/var/task/index.js:13:32210)
    at t.apiCall (/var/task/index.js:15:6190)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async /var/task/index.js:239:426

Unhandled Promise Rejection
{
    "errorType": "Runtime.UnhandledPromiseRejection",
    "errorMessage": "Error: An API error occurred: token_revoked",
    "reason":
    {
        "errorType": "Error",
        "errorMessage": "An API error occurred: token_revoked",
        "code": "slack_webapi_platform_error",
        "data":
        {
            "ok": false,
            "error": "token_revoked",
            "response_metadata":
            {}
        },
        "stack":
        [
            "Error: An API error occurred: token_revoked",
            "    at Jae (/var/task/index.js:13:32210)",
            "    at t.apiCall (/var/task/index.js:15:6190)",
            "    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)"
        ]
    },
    "promise":
    {},
    "stack":
    [
        "Runtime.UnhandledPromiseRejection: Error: An API error occurred: token_revoked",
        "    at process.<anonymous> (file:///var/runtime/index.mjs:1276:17)",
        "    at process.emit (node:events:517:28)",
        "    at emit (node:internal/process/promises:149:20)",
        "    at processPromiseRejections (node:internal/process/promises:283:27)",
        "    at process.processTicksAndRejections (node:internal/process/task_queues:96:32)"
    ]
}
[ERROR]  bolt-app Error: An API error occurred: token_revoked
    at Jae (/var/task/index.js:13:32210)
    at t.apiCall (/var/task/index.js:15:6190)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async /var/task/index.js:239:426
    at async RM (/var/task/index.js:83:23240)
    at async Array.<anonymous> (/var/task/index.js:80:445651)
    at async Array.<anonymous> (/var/task/index.js:80:441542)
    at async z9.processEvent (/var/task/index.js:83:35901)
    at async /var/task/index.js:121:12324 {
  code: 'slack_webapi_platform_error',
  data: { ok: false, error: 'token_revoked', response_metadata: {} }
}
Error:{"code":"slack_webapi_platform_error","data":{"ok":false,"error":"token_revoked","response_metadata":{}}}
misscoded commented 1 month ago

Hi @mihirkothari25! Thanks for raising this. A quick search internally reveals that this is not the first time that we've heard this issue from folks, but it's not immediately clear to me what the resolution is (if it even exists). I'll do a bit more digging and get back to you with more information (and hopefully a solution) next week.

filmaj commented 1 month ago

I looked into this a bit and found an interesting difference in behaviour when using the example app @mihirkothari25 provided vs. a socket-mode powered app (but same handler code: post a message, sleep for 3+ seconds, call complete). If you flip the log level to debug you may be able to also see this difference.

Here are logs for the socket mode powered app:

[DEBUG]  socket-mode:SocketModeClient:0 Received a message on the WebSocket: {"envelope_id":"25484494-c354-43e9-ba12-444858037c09","payload":{"team_id":"T029V6468RL","api_app_id":"A07T2726RFW","event":{"type":"function_executed","function":{"id":"Fn07TNFP4RJ4","callback_id":"sample_step","title":"Sample step","description":"Runs sample step","type":"app","input_parameters":[{"type":"slack#\/types\/user_id","name":"user_id","description":"Message recipient","title":"User","is_required":true,"hint":"Select a user in the workspace"},{"type":"slack#\/types\/interactivity","name":"interactivity","title":"Interactivity","is_required":true}],"output_parameters":[{"type":"slack#\/types\/user_id","name":"user_id","description":"User that completed the step","title":"User","is_required":true}],"category_id":"A07T2726RFW","category_label":"Custom Function Modal Test","app_id":"A07T2726RFW","date_created":1729605175,"date_released":0,"date_updated":1729605175,"date_deleted":0,"form_enabled":false},"inputs":{"user_id":"U02AEHE4KG9","interactivity":{"interactor":{"id":"U02AEHE4KG9","secret":"Nzk0OTY1Njc5MjA4MS4yMzM1MjA4MjEyODcwLjE4YjlkNjIxMTU1M2NhODQ3ZjIwNGY0ZTRkMTk3ZWUx"},"interactivity_pointer":"7949656792081.2335208212870.18b9d6211553ca847f204f4e4d197ee1"},"interactivity.interactor":{"id":"U02AEHE4KG9","secret":"Nzk0OTY1Njc5MjA4MS4yMzM1MjA4MjEyODcwLjE4YjlkNjIxMTU1M2NhODQ3ZjIwNGY0ZTRkMTk3ZWUx"},"interactivity.interactor.id":"U02AEHE4KG9","interactivity.interactor.secret":"Nzk0OTY1Njc5MjA4MS4yMzM1MjA4MjEyODcwLjE4YjlkNjIxMTU1M2NhODQ3ZjIwNGY0ZTRkMTk3ZWUx","interactivity.interactivity_pointer":"7949656792081.2335208212870.18b9d6211553ca847f204f4e4d197ee1"},"function_execution_id":"Fx07TJUPC3T4","workflow_execution_id":"Wx07TG3F4Z61","event_ts":"1729871460.565880"},"type":"event_callback","event_id":"Ev07TJUPCJ58","event_time":1729871460},"type":"events_api","accepts_response_payload":false,"retry_attempt":0,"retry_reason":""}
[DEBUG]  socket-mode:SocketModeClient:0 Calling ack() - type: events_api, envelope_id: 25484494-c354-43e9-ba12-444858037c09, data: undefined
[DEBUG]  socket-mode:SocketModeClient:0 send() method was called (WebSocket state: OPEN)
[DEBUG]  socket-mode:SlackWebSocket:1 isActive(): websocket ready state is OPEN
[DEBUG]  socket-mode:SocketModeClient:0 Sending a WebSocket message: {"envelope_id":"25484494-c354-43e9-ba12-444858037c09","payload":{}}
[DEBUG]  web-api:WebClient:2 initialized
[DEBUG]  bolt-app No conversation ID for incoming event
[DEBUG]  web-api:WebClient:3 initialized
[DEBUG]  web-api:WebClient:3 apiCall('chat.postMessage') start
[DEBUG]  web-api:WebClient:3 http request url: https://slack.com/api/chat.postMessage
[DEBUG]  web-api:WebClient:3 http request body: {"channel":"U02AEHE4KG9","text":"custom function has started yo"}
[DEBUG]  web-api:WebClient:3 http request headers: {"Accept":"application/json, text/plain, */*","User-Agent":"@slack:socket-mode/2.0.2 @slack:bolt/4.0.1 @slack:web-api/7.6.0 node/18.15.0 darwin/23.6.0","Authorization":"[[REDACTED]]"}
[DEBUG]  web-api:WebClient:3 http response received
[DEBUG]  web-api:WebClient:3 http request result: {"ok":true,"channel":"D07SWRZHNQM","ts":"1729871460.767479","message":{"user":"U07SZK61TFV","type":"message","ts":"1729871460.767479","bot_id":"B07TCCFMA6M","app_id":"A07T2726RFW","text":"custom function has started yo","team":"T029V6468RL","bot_profile":{"id":"B07TCCFMA6M","app_id":"A07T2726RFW","name":"Custom Function Modal Test","icons":{"image_36":"https://a.slack-edge.com/80588/img/plugins/app/bot_36.png","image_48":"https://a.slack-edge.com/80588/img/plugins/app/bot_48.png","image_72":"https://a.slack-edge.com/80588/img/plugins/app/service_72.png"},"deleted":false,"updated":1729605216,"team_id":"T029V6468RL"},"blocks":[{"type":"rich_text","block_id":"pUbD","elements":[{"type":"rich_text_section","elements":[{"type":"text","text":"custom function has started yo"}]}]}]},"response_metadata":{"scopes":["chat:write"],"acceptedScopes":["chat:write"]}}
[DEBUG]  web-api:WebClient:3 apiCall('chat.postMessage') end
[DEBUG]  web-api:WebClient:3 apiCall('functions.completeSuccess') start
[DEBUG]  web-api:WebClient:3 http request url: https://slack.com/api/functions.completeSuccess
[DEBUG]  web-api:WebClient:3 http request body: {"token":"[[REDACTED]]","outputs":"{\"user_id\":\"U02AEHE4KG9\"}","function_execution_id":"Fx07TJUPC3T4"}
[DEBUG]  web-api:WebClient:3 http request headers: {"Accept":"application/json, text/plain, */*","User-Agent":"@slack:socket-mode/2.0.2 @slack:bolt/4.0.1 @slack:web-api/7.6.0 node/18.15.0 darwin/23.6.0","Authorization":"[[REDACTED]]"}
[DEBUG]  web-api:WebClient:3 http response received
[DEBUG]  web-api:WebClient:3 http request result: {"ok":true,"response_metadata":{"scopes":["chat:write"]}}
[DEBUG]  web-api:WebClient:3 apiCall('functions.completeSuccess') end

And here are the logs for the AwsLambdaReceiver-powered sample that the OP provided:

POST /slack/events (λ: slack)
[DEBUG]  web-api:WebClient:1 initialized
[DEBUG]  bolt-app No conversation ID for incoming event
[DEBUG]  web-api:WebClient:2 initialized
[DEBUG]  web-api:WebClient:2 apiCall('chat.postMessage') start
[DEBUG]  web-api:WebClient:2 http request url: https://slack.com/api/chat.postMessage
[DEBUG]  web-api:WebClient:2 http request body: {"channel":"U02AEHE4KG9","text":"custom function has started yo"}
[DEBUG]  web-api:WebClient:2 http request headers: {}
[DEBUG]  web-api:WebClient:2 http response received
[DEBUG]  web-api:WebClient:2 http request result: {"ok":true,"channel":"D07SWRZHNQM","ts":"1729872308.257559","message":{"user":"U07SZK61TFV","type":"message","ts":"1729872308.257559","bot_id":"B07TCCFMA6M","app_id":"A07T2726RFW","text":"custom function has started yo","team":"T029V6468RL","bot_profile":{"id":"B07TCCFMA6M","app_id":"A07T2726RFW","name":"Custom Function Modal Test","icons":{"image_36":"https://a.slack-edge.com/80588/img/plugins/app/bot_36.png","image_48":"https://a.slack-edge.com/80588/img/plugins/app/bot_48.png","image_72":"https://a.slack-edge.com/80588/img/plugins/app/service_72.png"},"deleted":false,"updated":1729605216,"team_id":"T029V6468RL"},"blocks":[{"type":"rich_text","block_id":"zm0e","elements":[{"type":"rich_text_section","elements":[{"type":"text","text":"custom function has started yo"}]}]}]},"response_metadata":{"scopes":["chat:write"],"acceptedScopes":["chat:write"]}}
[DEBUG]  web-api:WebClient:2 apiCall('chat.postMessage') end
[DEBUG]  web-api:WebClient:2 apiCall('functions.completeSuccess') start
[DEBUG]  web-api:WebClient:2 http request url: https://slack.com/api/functions.completeSuccess
[DEBUG]  web-api:WebClient:2 http request body: {"token":"[[REDACTED]]","outputs":"{\"user_id\":\"U02AEHE4KG9\"}","function_execution_id":"Fx07TMGPUZ5J"}
[DEBUG]  web-api:WebClient:2 http request headers: {"Authorization":"[[REDACTED]]"}
[DEBUG]  web-api:WebClient:2 http response received
Error: An API error occurred: token_revoked
    at platformErrorFromResult (/Users/fmaj/src/bolt-js-custom-step-template/node_modules/@slack/web-api/dist/errors.js:62:33)
    at WebClient.apiCall (/Users/fmaj/src/bolt-js-custom-step-template/node_modules/@slack/web-api/dist/WebClient.js:181:56)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async /Users/fmaj/src/bolt-js-custom-step-template/app.js:30:5
    at async processFunctionMiddleware (/Users/fmaj/src/bolt-js-custom-step-template/node_modules/@slack/bolt/dist/CustomFunction.js:107:9)
    at async Array.<anonymous> (/Users/fmaj/src/bolt-js-custom-step-template/node_modules/@slack/bolt/dist/conversation-store.js:67:9)
    at async Array.<anonymous> (/Users/fmaj/src/bolt-js-custom-step-template/node_modules/@slack/bolt/dist/middleware/builtin.js:282:9)
    at async App.processEvent (/Users/fmaj/src/bolt-js-custom-step-template/node_modules/@slack/bolt/dist/App.js:598:13)
    at async /Users/fmaj/src/bolt-js-custom-step-template/node_modules/@slack/bolt/dist/receivers/AwsLambdaReceiver.js:122:17
    at async MessagePort.<anonymous> (file:///Users/fmaj/src/bolt-js-custom-step-template/node_modules/serverless-offline/src/lambda/handler-runner/worker-thread-runner/workerThreadHelper.js:24:14) {
  code: 'slack_webapi_platform_error',
  data: { ok: false, error: 'token_revoked', response_metadata: {} }
}

Even if I remove the call complete in the Lambda receiver case, I still get a message from Slackbot telling me the function did not execute. I think this may be a bug... I still need to investigate some more, but I think the main difference between the Lambda receiver code vs. the socket mode code is that the socket mode code immediately acknowledges the event (you can see this in the socket mode debug logs in the first few lines). The Lambda receiver code, however, does not. This may be a bug in the custom step + Lambda receiver case. Will continue looking into it and report back.

filmaj commented 1 month ago

Actually now that I think of it, NOT acknowledging the HTTP-based event in the Lambda receiver is by design. This is because acknowledging the HTTP event means returning an HTTP response to Slack's HTTP POST to the app. Doing so in the context of a Lambda means that this terminates the Lambda execution - effectively the lifecycle of the Lambda function is complete as soon as we send the HTTP response.

This is not a problem with Bolt so much as a standard challenge when building applications in a serverless environment. Such an application needs to change how it is built and use a different mental model and architecture. There are two constraints your app needs to work with:

Given these two constraints, you need to separate your application logic in a way that is different from e.g. a single long-running process powering Bolt and able to handle multiple incoming HTTP requests in parallel. My suggestion would be split up your functionality into discrete, separate Lambdas, each with their own lifecycle and execution. The following architecture should work:

  1. Your main app.function handler has a small scope and few responsibilities in order to be able to acknowledge the incoming HTTP request from Slack within 3 seconds. For example, perhaps all it does is extract the inputs from the function event payload and invokes a different Lambda in an asynchronous/decoupled way that has no constraints of execution time, passing the event data to the 'worker' Lambda. Having a small set of responsibilities like this would almost guarantee that your main function-handler Lambda always executes within 3 seconds (satisfying the Slack event HTTP dispatch response time constraint).
  2. The worker lambda, once invoked with the function data, can then take its time to do whatever long-running API calls it needs to do. At the end of its handling, it can call the functions.completeSuccess HTTP API to complete the custom step and have the housing workflow move on to the next step (this is the API that the complete handler within Bolt function handlers call; see this Bolt code for how the complete utility method is created).

Hope this helps.

mihirkothari25 commented 3 weeks ago

@filmaj Hey, thank you for responding. Based on my reading of your response, are these assumptions valid -

  1. The lambda function first invoked by Slack (via API Gateway) has to end its execution in 3 seconds. If the time from when the request started until the first complete is called in the lambda function, it has to be less than or equal to 3 seconds.
  2. The app.function callback function doesn't have an ack passed to it, so even if the lambda responds with a post_message within 3 seconds, the complete has to occur within 3 seconds. Is that correct?

Do you know how the handler lambda pass the slack client to the worker lambda? I am concerned about going down that route of breaking up the lambda into a handler lambda and a worker lambda, and still encountering the token_revoked error because the same token would be passed from the handler lambda to the worker lambda.

filmaj commented 3 weeks ago

The lambda function first invoked by Slack (via API Gateway) has to end its execution in 3 seconds.

This is correct, but it's important to understand why: because the HTTP request spawned by Slack is tied to the lambda invocation life. When your lambda finishes, you will be sending a response back to the Slack HTTP Request. Slack requires you to send your HTTP response within 3 seconds.

If the time from when the request started until the first complete is called in the lambda function, it has to be less than or equal to 3 seconds.

This is incorrect. complete, which is the same as the functions.completeSuccess HTTP API (complete within Bolt is nothing more than a wrapper for the functions.completeSuccess HTTP API), has no requirement on when it must be called. complete / functions.completeSuccess can be called days, weeks, in theory months after the function executes. The use case we use internally as a thought experiment to help understand the workflow system is: consider a custom function that requires some kind of manager approval, that DMs your manager with approve/deny buttons in a message. Clicking either of the buttons completes the function. If the manager is off on PTO, it could be weeks before they click the button and complete the function.

The app.function callback function doesn't have an ack passed to it, so even if the lambda responds with a post_message within 3 seconds, the complete has to occur within 3 seconds. Is that correct?

Incorrect, due to the same difference I mentioned in the previous point. The event payload delivered from Slack needs to be responded to within 3 seconds. That response to the Slack HTTP request is different / a completely separate consideration from calling the functions.completeSuccess API signaling that your custom function has completed and can now provide outputs to the next step in the housing workflow.