"Operation Timeout Error" on first load for AWS Lambda with SSM

RyanSchaefer commented 3 years ago

I am experiencing an issue when running slack bolt on AWS Lambda, specifically the module does not load fast enough for the ack() to be sent before 3 seconds on first load. Whenever the loaded modules are no longer cached, this issue happens again. We moved to slack bolt from a previous method where we had an SQS queue -> Lambda -> Lambda where the first lambda would return a 200 and the second would do the actual processing because slack bolt promises to be more simple and feature rich. We have some special wrapping to support creating a Class wrapper around the app to prevent issues with caching global variables in Lambda as well as support for a single entrypoint /command from which a subcommand can be supplied with arguments in order to support more dynamic additions of commands to the bot without needing to create extra /commands on api.slack.com.

My main question is: Are we using this module incorrectly or is there some sort of tuning we can preform to eliminate this error?

Reproducible in:

MCVE

def supply_self(func, this):
    """
    Adapts from a three argument function to a two argument function
    :param func: the function to supply self to
    :param this: renamed self to prevent conflicts with how bolt wires its functions
    :return: a function which can be called by bolt
    """
    return lambda respond, body: func(this, respond, body)

class BasicBot:
    """
    Instead of making app global (which could cause caching issues), we want to make it a field of an object we can create
    """

    def __init__(self, entrypoint_name: str, *args):
        self.app = slack_bolt.App(process_before_response=True,
                                  signing_secret=SSMParameter(os.environ["SIGNING_SECRET_LOCATION"], max_age=3600).value,
                                  token=SSMParameter(os.environ["SLACK_API_LOCATION"], 3600).value)
        self.app.command(entrypoint_name)(ack=lambda body, ack: ack(),
                                                  # unable to call self.func here because slack bolt assumes
                                                  # procedural code, by supplying self as a different argument we can
                                                  # circumvent this
                                                  lazy=[supply_self(func, self)])
        # the above `func` comes from a decorated method within the class
        # ...

def lambda_handler(event, context):
    aws_lambda_logging.setup(aws_request_id=context.aws_request_id, level="INFO")
    logging.info(event)
    bot = BasicBot(os.environ["COMMAND_ENTRYPOINT"])
    return SlackRequestHandler(app=bot.app).handle(event, context)

The `slack_bolt` version

Latest version

Python runtime version

3.8

OS info

AWS Lambda's python environment

Steps to reproduce:

(Share the commands to run, source code, and project settings (e.g., setup.py))

Deploy Labmda
call /entrypoint with some say()
observe operation timeout
observe say() message being sent
call /entrypoint again
observe no timeout

Expected result:

No "Operation Timeout Error" on first call

Actual result:

Timeout when module is not cached then no timeout when Lambda caches it. Message is sent in both cases but the first degrades user experience.

Requirements

Please read the Contributing guidelines and Code of Conduct before creating this issue or pull request. By submitting, you are agreeing to those rules.

seratch commented 3 years ago

Thanks for asking the question. I can check the actual behavior by deploying a Lambda function with a similar structure later. Before that, let me share some general thoughts on this:

I have been operating a Bolt for Python app running on AWS Lambda for a while and the app never encounters cold-start issues. slack_bolt depends on only slack_sdk package. slack_sdk does not have any additional dependencies(!). So, loading only these modules should not take so long.
JFYI: This is not specific to Bolt. When I tried AWS Chalice library for Slack apps, I often encountered cold-starts and Slack app 3 second timeouts with it. I think Chalice is an amazing library and it is a great fit for many use cases. However, specifically for Slack apps, it may not be a better solution than a thin Lambda function.
Similarly, I think some other dependencies may take a bit long for their initialization. From my past experience, I needed to warm AWS S3 client (by the AWS Java SDK) instances up as it takes long for initialization. I don't think using SSMParameter costs but if something similar is happening to your app, that makes sense. If this assumption is correct, you can move the initialization to the lazy function (this means moving heavy module imports and/or initialization to the inside of a lazy function).
I'm curious about the with some say() in your repro steps. Do you have these API calls in the ack() listener function or in one of lazy functions? If you have it in ack function, you can move it to lazy functions. The response time from the Slack API server is consistent but if performing multiple Slack API calls in ack function, it is not recommended.

If you already have a minimal working example that reproduces the issue, sharing the source is helpful to understand your issue.

RyanSchaefer commented 3 years ago

We are using SAM rather than chalice. So here is a minimal template to deploy:

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31

Resources:
  QueueProcessor:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: ./src
      Events:
        HttpApiEvent:
          Type: HttpApi
      Handler: # ... lambda_handler
      Policies:
        - Version: "2012-10-17"
          Statement:
            - Effect: Allow
              Action:
                - logs:CreateLogGroup
                - logs:CreateLogStream
                - logs:PutLogEvents
              Resource: !Sub "arn:aws:logs:${AWS::Region}:${AWS::AccountId}:*"
            # function must be able to call itself in order to 1st handle acking within 3 seconds then 2nd
            # process the actual message
            - Effect: Allow
              Action: "lambda:InvokeFunction"
              Resource: !Sub "arn:${AWS::Partition}:lambda:${AWS::Region}:${AWS::AccountId}:function:${AWS::StackName}-*"

The ack() is separate from the response. I tried refactoring to this:

app = slack_bolt.App(process_before_response=True,
                                     signing_secret=SSMParameter(os.environ["SIGNING_SECRET_LOCATION"], max_age=3600).value,
                                     token=SSMParameter(os.environ["SLACK_API_LOCATION"], 3600).value)
app.event(re.compile(".*"))(
            ack=lambda body, ack: ack(),
            lazy=[lambda respond, body: SlackRequestHandler(app=BasicBot(os.environ["COMMAND_ENTRYPOINT"]).app).handle(event, context)])

to basically ack() the message as soon as possible but I think that, again, the fetching the SSMParameter is taking too long and pushing the ack() back enough that causes a timeout. Enough is cached that on subsequent calls there are no issues.

It's just tricky because I essentially need an App without secrets to handle the initial ack(), then once that is processed I can do as much processing/loading as I want.

With regards to the last point. I actually use respond() in a function seperate from the ack. As you can see in the intially posted code, my ack is a minimal function which just calls the ack provided to it:

ack=lambda body, ack: ack()

mew1033 commented 3 years ago

@RyanSchaefer I solved this by setting up the ApiGateway to respond back to slack for me if the lambda function is about to time out. Here's the relevant part of my sam template:

  ApiGatewayApi:
    Type: AWS::Serverless::Api
    Properties:
      DefinitionBody:
        openapi: 3.0.1
        info:
          title: API Title
        servers:
        - url: /v1
        paths:
          /slack:
            post:
              summary: Main Slack posting api resource.
              description: Main Slack posting resource. This is where you point the slack endpoint to receive events.
              responses:
                "200":
                  description: Request Received Okay!
              x-amazon-apigateway-integration:
                type: aws_proxy
                uri: arn-to-lambda-function
                httpMethod: POST
                timeoutInMillis: 2300
                responses:
                  default:
                    statusCode: 200
      GatewayResponses:
        INTEGRATION_TIMEOUT:
          StatusCode: 200

Specifically the timeoutInMillis and GatewayResponses sections.

RyanSchaefer commented 3 years ago

Oh wow, that is super handy! I am sure I could also tweak that to always respond within 3 seconds then use respond or say to send the actual message later

seratch commented 3 years ago

@mew1033 Thanks for sharing the knowledge! This is interesting. If your app handles only Events API requests and does not use any interactive features (e.g., buttons, modals), this workaround should work.

@RyanSchaefer Thanks for sharing the details.

the fetching the SSMParameter is taking too long

So, this is the root cause of this issue. In this case, API Gateway + AWS Lambda may not be a great fit for running your Slack apps (as long as you use SSM for loading credentails). This will affect not only cold-start timeouts but also quality of user experience (due to occasional slow responses). Also, I don't think there is anything that Bolt and its underlying Python SDK can improve the situation.

Only things I can suggest or recommend for this are:

Consider not using SSM for credentials (probably, this is not acceptable)
Consider other AWS services like ECS, Lightsail Containers, EC2, and whatever does not have cold-start issues

mew1033 commented 3 years ago

@mew1033 Thanks for sharing the knowledge! This is interesting. If your app handles only Events API requests and does not use any interactive features (e.g., buttons, modals), this workaround should work.

I don't see why this wouldn't work for interactive features as well. Just send back the 200 from the API Gateway itself, then let the function keep working and send its response later. Maybe I'm missing something though, would this model not work?

seratch commented 3 years ago

@mew1033 When you handle modal submissions with response_action (errors, update, push, clear), https://api.slack.com/surfaces/modals/using#displaying_errors

the only way to do that is to have response_action in your HTTP response by ack({"response_action": "errors", "errors": {"your-block-id": "The value in this field must be longer than 5 characters"}}) or similar.

So, the essential solution to avoid timeouts (or retries by Events API) is to ensure your app returns an HTTP response within 3 seconds in any cases.

mew1033 commented 3 years ago

Ah, got it. Thanks!

RyanSchaefer commented 3 years ago

Amazon only supports Gateway responses on its REST Api endpoints but I am using a HttpApi endpoint. It appears this issue is still valid for this reason

seratch commented 3 years ago

As I just came up with another idea, let me share with you.

If it's totally fine to just accept any incoming requests (this is what the gateway response approach does) for you, you can have a quite simple internet-facing lambda function that just enqueues the event data to AWS queue service (say, SQS, Kinesis Data Streams). For messages in the queue, you can serialize it in JSON or any text format. Then, when dequeueing the messages in another lambda function, you can still use Bolt for processing them. In this scenario, returning 200 OK to Slack within 3 seconds should not be hard.

You can verify the request signature as long as you serialize all the data in event and reject unexpected requests. The downsides of this approach are 1) you cannot use ack() in the async Bolt code, and 2) you need to operate a queue system.

Hope this is helpful to you.

seratch commented 3 years ago

@RyanSchaefer I know probably your issue described here may not be resolved yet but ...

the fetching the SSMParameter is taking too long So, this is the root cause of this issue.

the root cause of your issue is related to the initialization time for SSM SDK and it's not specific to Bolt for Python. I do understand the issue is related to the Slack Platform's 3 second timeout requirements. However, in the meantime, we would like to use this issue tracker mainly for managing this repository's issues.

For this reason, if you don't mind, may I close this topic now?

If you would like to have another open discussion about FaaS deployments in general, I am happy to have another issue for it. But I recommend going to our community Slack workspace for such type of discussion. You can join community.slack.com from here to find other folks working on Slack apps.

seratch commented 3 years ago

For this reason, if you don't mind, may I close this topic now?

Let us close this issue now. Feel free to write in as necessary and/or open a new issue if you encounter an issue specific to this library.

slackapi / bolt-python