Closed RyanSchaefer closed 3 years ago
Thanks for asking the question. I can check the actual behavior by deploying a Lambda function with a similar structure later. Before that, let me share some general thoughts on this:
slack_bolt
depends on only slack_sdk
package. slack_sdk
does not have any additional dependencies(!). So, loading only these modules should not take so long.SSMParameter
costs but if something similar is happening to your app, that makes sense. If this assumption is correct, you can move the initialization to the lazy function (this means moving heavy module imports and/or initialization to the inside of a lazy function).with some say()
in your repro steps. Do you have these API calls in the ack()
listener function or in one of lazy functions? If you have it in ack function, you can move it to lazy functions. The response time from the Slack API server is consistent but if performing multiple Slack API calls in ack function, it is not recommended.If you already have a minimal working example that reproduces the issue, sharing the source is helpful to understand your issue.
We are using SAM rather than chalice. So here is a minimal template to deploy:
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Resources:
QueueProcessor:
Type: AWS::Serverless::Function
Properties:
CodeUri: ./src
Events:
HttpApiEvent:
Type: HttpApi
Handler: # ... lambda_handler
Policies:
- Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- logs:CreateLogGroup
- logs:CreateLogStream
- logs:PutLogEvents
Resource: !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:*"
# function must be able to call itself in order to 1st handle acking within 3 seconds then 2nd
# process the actual message
- Effect: Allow
Action: "lambda:InvokeFunction"
Resource: !Sub "arn:${AWS::Partition}:lambda:${AWS::Region}:${AWS::AccountId}:function:${AWS::StackName}-*"
The ack()
is separate from the response. I tried refactoring to this:
app = slack_bolt.App(process_before_response=True,
signing_secret=SSMParameter(os.environ["SIGNING_SECRET_LOCATION"], max_age=3600).value,
token=SSMParameter(os.environ["SLACK_API_LOCATION"], 3600).value)
app.event(re.compile(".*"))(
ack=lambda body, ack: ack(),
lazy=[lambda respond, body: SlackRequestHandler(app=BasicBot(os.environ["COMMAND_ENTRYPOINT"]).app).handle(event, context)])
to basically ack()
the message as soon as possible but I think that, again, the fetching the SSMParameter
is taking too long and pushing the ack()
back enough that causes a timeout. Enough is cached that on subsequent calls there are no issues.
It's just tricky because I essentially need an App
without secrets to handle the initial ack()
, then once that is processed I can do as much processing/loading as I want.
With regards to the last point. I actually use respond()
in a function seperate from the ack
. As you can see in the intially posted code, my ack
is a minimal function which just calls the ack
provided to it:
ack=lambda body, ack: ack()
@RyanSchaefer I solved this by setting up the ApiGateway to respond back to slack for me if the lambda function is about to time out. Here's the relevant part of my sam template:
ApiGatewayApi:
Type: AWS::Serverless::Api
Properties:
DefinitionBody:
openapi: 3.0.1
info:
title: API Title
servers:
- url: /v1
paths:
/slack:
post:
summary: Main Slack posting api resource.
description: Main Slack posting resource. This is where you point the slack endpoint to receive events.
responses:
"200":
description: Request Received Okay!
x-amazon-apigateway-integration:
type: aws_proxy
uri: arn-to-lambda-function
httpMethod: POST
timeoutInMillis: 2300
responses:
default:
statusCode: 200
GatewayResponses:
INTEGRATION_TIMEOUT:
StatusCode: 200
Specifically the timeoutInMillis
and GatewayResponses
sections.
Oh wow, that is super handy! I am sure I could also tweak that to always respond within 3 seconds then use respond or say to send the actual message later
@mew1033 Thanks for sharing the knowledge! This is interesting. If your app handles only Events API requests and does not use any interactive features (e.g., buttons, modals), this workaround should work.
@RyanSchaefer Thanks for sharing the details.
the fetching the SSMParameter is taking too long
So, this is the root cause of this issue. In this case, API Gateway + AWS Lambda may not be a great fit for running your Slack apps (as long as you use SSM for loading credentails). This will affect not only cold-start timeouts but also quality of user experience (due to occasional slow responses). Also, I don't think there is anything that Bolt and its underlying Python SDK can improve the situation.
Only things I can suggest or recommend for this are:
@mew1033 Thanks for sharing the knowledge! This is interesting. If your app handles only Events API requests and does not use any interactive features (e.g., buttons, modals), this workaround should work.
I don't see why this wouldn't work for interactive features as well. Just send back the 200 from the API Gateway itself, then let the function keep working and send its response later. Maybe I'm missing something though, would this model not work?
@mew1033 When you handle modal submissions with response_action
(errors, update, push, clear), https://api.slack.com/surfaces/modals/using#displaying_errors
the only way to do that is to have response_action in your HTTP response by ack({"response_action": "errors", "errors": {"your-block-id": "The value in this field must be longer than 5 characters"}})
or similar.
So, the essential solution to avoid timeouts (or retries by Events API) is to ensure your app returns an HTTP response within 3 seconds in any cases.
Ah, got it. Thanks!
Amazon only supports Gateway responses on its REST Api endpoints but I am using a HttpApi endpoint. It appears this issue is still valid for this reason
As I just came up with another idea, let me share with you.
If it's totally fine to just accept any incoming requests (this is what the gateway response approach does) for you, you can have a quite simple internet-facing lambda function that just enqueues the event
data to AWS queue service (say, SQS, Kinesis Data Streams). For messages in the queue, you can serialize it in JSON or any text format. Then, when dequeueing the messages in another lambda function, you can still use Bolt for processing them. In this scenario, returning 200 OK to Slack within 3 seconds should not be hard.
You can verify the request signature as long as you serialize all the data in event
and reject unexpected requests. The downsides of this approach are 1) you cannot use ack()
in the async Bolt code, and 2) you need to operate a queue system.
Hope this is helpful to you.
@RyanSchaefer I know probably your issue described here may not be resolved yet but ...
the fetching the SSMParameter is taking too long So, this is the root cause of this issue.
the root cause of your issue is related to the initialization time for SSM SDK and it's not specific to Bolt for Python. I do understand the issue is related to the Slack Platform's 3 second timeout requirements. However, in the meantime, we would like to use this issue tracker mainly for managing this repository's issues.
For this reason, if you don't mind, may I close this topic now?
If you would like to have another open discussion about FaaS deployments in general, I am happy to have another issue for it. But I recommend going to our community Slack workspace for such type of discussion. You can join community.slack.com from here to find other folks working on Slack apps.
For this reason, if you don't mind, may I close this topic now?
Let us close this issue now. Feel free to write in as necessary and/or open a new issue if you encounter an issue specific to this library.
I am experiencing an issue when running slack bolt on AWS Lambda, specifically the module does not load fast enough for the
ack()
to be sent before 3 seconds on first load. Whenever the loaded modules are no longer cached, this issue happens again. We moved to slack bolt from a previous method where we had an SQS queue -> Lambda -> Lambda where the first lambda would return a200
and the second would do the actual processing because slack bolt promises to be more simple and feature rich. We have some special wrapping to support creating a Class wrapper around the app to prevent issues with caching global variables in Lambda as well as support for a single entrypoint/command
from which a subcommand can be supplied with arguments in order to support more dynamic additions of commands to the bot without needing to create extra/commands
on api.slack.com.My main question is: Are we using this module incorrectly or is there some sort of tuning we can preform to eliminate this error?
Reproducible in:
MCVE
The
slack_bolt
versionLatest version
Python runtime version
3.8
OS info
AWS Lambda's python environment
Steps to reproduce:
(Share the commands to run, source code, and project settings (e.g., setup.py))
/entrypoint
with somesay()
say()
message being sent/entrypoint
againExpected result:
No "Operation Timeout Error" on first call
Actual result:
Timeout when module is not cached then no timeout when Lambda caches it. Message is sent in both cases but the first degrades user experience.
Requirements
Please read the Contributing guidelines and Code of Conduct before creating this issue or pull request. By submitting, you are agreeing to those rules.