slackapi / bolt-python

A framework to build Slack apps using Python
https://tools.slack.dev/bolt-python/
MIT License
1.07k stars 245 forks source link

Intermittent Shortcut Launch Errors from app on AWS Lambda #749

Closed JustinNewport closed 1 year ago

JustinNewport commented 2 years ago

Appears that either slack fails to respond to auth.test call from App object during initialization, or response is lost on AWS side. Issue is intermittent as appears to follow no pattern.

Reproducible in:

AWS Lambda slack_bolt 1.15.1 (applied as layer, also fails when zipped and packaged) slack_sdk 3.19.1 (applied as layer, also fails when zipped and packaged)

Python runtime version

Python 3.7

Expected result:

Receive a response from https://www.slack.com/api/auth.test via App() instantiation.

Actual result:

  1. Launch shortcut from Slack
  2. Modal fails to appear in slack. Lambda function spins (even with timeout over 10 seconds), waiting for above response.

Output of CloudWatch log when failure occurs below. This would be followed by a task timeout message:

2022-10-24 13:15:26,362 Sending a request - url: https://www.slack.com/api/auth.test, query_params: {}, body_params: {}, files: {}, json_body: None, headers: {'Content-Type': 'application/x-www-form-urlencoded', 'Authorization': '(redacted)', 'User-Agent': 'Bolt/1.15.1 Python/3.7.13 slackclient/3.19.1 Linux/4.14.255-276-224.499.amzn2.x86_64'}
  1. Other times, no issue occurs. Pattern appears to be random. Successful response minutes later:
2022-10-24 13:21:00,647 Sending a request - url: https://www.slack.com/api/auth.test, query_params: {}, body_params: {}, files: {}, json_body: None, headers: {'Content-Type': 'application/x-www-form-urlencoded', 'Authorization': '(redacted)', 'User-Agent': 'Bolt/1.15.1 Python/3.7.13 slackclient/3.19.1 Linux/4.14.255-276-224.499.amzn2.x86_64'}

2022-10-24 13:21:00,701 Received the following response - status: 200, headers: {'date': 'Mon, 24 Oct 2022 13:21:00 GMT', 'server': 'Apache', 'x-powered-by': 'HHVM/4.153.1', 'access-control-allow-origin': '*', 'referrer-policy': 'no-referrer', 'x-slack-backend': 'r', 'x-slack-unique-id': 'Y1aRPMgQRAv8zYfVGPJykAAAAAc', 'strict-transport-security': 'max-age=31536000; includeSubDomains; preload', 'access-control-allow-headers': 'slack-route, x-slack-version-ts, x-b3-traceid, x-b3-spanid, x-b3-parentspanid, x-b3-sampled, x-b3-flags', 'access-control-expose-headers': 'x-slack-req-id, retry-after', 'x-oauth-scopes': 'commands', 'expires': 'Mon, 26 Jul 1997 05:00:00 GMT', 'cache-control': 'private, no-cache, no-store, must-revalidate', 'pragma': 'no-cache', 'x-xss-protection': '0', 'x-content-type-options': 'nosniff', 'x-slack-req-id': '48d1efcba29405abd432832d35463266', 'vary': 'Accept-Encoding', 'content-type': 'application/json; charset=utf-8', 'x-envoy-upstream-service-time': '18', 'x-backend': 'main_normal main_bedrock_normal_with_overflow main_canary_with_overflow main_bedrock_canary_with_overflow main_control_with_overflow main_bedrock_control_with_overflow', 'x-server': 'slack-www-hhvm-main-iad-thgm', 'x-slack-shared-secret-outcome': 'no-match', 'via': 'envoy-www-iad-3jcf, envoy-edge-iad-sv5m', 'x-edge-backend': 'envoy-www', 'x-slack-edge-shared-secret-outcome': 'no-match', 'connection': 'close', 'transfer-encoding': 'chunked'}, body: 
{
    "ok": true,
    "url": |removed|,
    "team": |removed|, 
    "user": |removed|,
    "team_id": |removed|,
    "user_id": |removed|,
    "bot_id": |removed|,
    "is_enterprise_install": false
}

But then a few more minutes later, nothing happens. So this is not a lambda cold start issue.

Abbreviated code for simplicity, but is representative of structure

import logging
from slack_bolt import App
from slack_bolt.adapter.aws_lambda import SlackRequestHandler

SlackRequestHandler.clear_all_log_handlers()
logging.basicConfig(format="%(asctime)s %(message)s", level=logging.DEBUG)

app = App(process_before_response=True)

@app.shortcut("our-callback-id")
def launch_app(ack, client, shortcut):
    ack()
    client.views_open(
        trigger_id=shortcut['trigger_id'],
        view= {}
    )

def handler(event, context):
    slack_handler = SlackRequestHandler(app=app)
    resp = slack_handler.handle(event, context)
    return resp

Requirements

Please read the Contributing guidelines and Code of Conduct before creating this issue or pull request. By submitting, you are agreeing to those rules.

eddyg commented 2 years ago

This sounds like it could be related to the "cold start" delays in AWS Lambda, causing your application to respond slowly after some amount of time. Check out this blog post for some details around how and when cold starts can occur.

There are several (closed) issues in this repo that talk about possible work-arounds; searching for these issues may provide some suggestions.

JustinNewport commented 2 years ago

This sounds like it could be related to the "cold start" delays in AWS Lambda, causing your application to respond slowly after some amount of time. Check out this blog post for some details around how and when cold starts can occur.

There are several (closed) issues in this repo that talk about possible work-arounds; searching for these issues may provide some suggestions.

Does not appear to be similar to these issues. It can successfully process a number of successive calls and then suddenly stop working. Additionally, all modules are loaded without issue. The app hangs waiting for a response from Slack

seratch commented 2 years ago

Hi @JustinNewport, with given information, I cannot tell the direct cause of the auth.test API call issue but one thing you may want to try would be changing the timeout setting for API calls. With that, you may observe some behavior changes of your app.

app = App(
    # The default timeout is 30 seconds
    client=WebClient(timeout=5),
    process_before_response=True,
)

Also, when the scenario occurs, what do the lambda function logs show for you? Especially, I am curious about the execution timing logs on the AWS Lambda side, not your bolt-python app execution.

I hope you will figure out the cause soon!

JustinNewport commented 2 years ago

Hi @seratch, I tried the above approach today and saw no difference in results. It would work now and then, but when it did not, the only thing that showed up in the lambda logs was the same output as above. See also below. This failed run followed a successful run just 6 minutes prior lambdaLogs_withWebClientObject_failure.txt

seratch commented 2 years ago

@JustinNewport With given information, I don't think that this is an SDK issue. Since auth.test API is pretty stable, I cannot think of such frequent outage of the endpoint.

If you haven't yet contacted our customer support agents, I would recommend asking them to do some Slack server-side investigation. You can submit an inquiry from either /feedback in your Slack workspace or https://my.slack.com/help/requests/new.

Once you share your workspace ID (team_id) and App ID (api_app_id / https://api.slack.com/apps/{api_app_id}) with them, they can check the records on your auth.test API calls. If some of auth.test requests from your app do not reach to Slack, no server-side logs should exist. If your app is struggling to receive responses from Slack for some reason (e.g., network, AWS-related issues), Slack server-side should have logs on the API calls for sure.

JustinNewport commented 2 years ago

Hi @seratch, it turns out that our Lambda timeout was too short, cutting off the timeout for the sdk. Once we set the Lambda to >5 minutes, the sdk timeout (appears to be 4 min, 30 seconds) was reached which allowed a stack trace to be generated.

This of course reveals that while the Lambda logs showed

Sending a request - url: https://www.slack.com/api/auth.test, query_params: {}, body_params: {}, files: {}, json_body: None, headers: {'Content-Type': 'application/x-www-form-urlencoded', 'Authorization': '(redacted)', 'User-Agent': 'Bolt/1.15.1 Python/3.7.13 slackclient/3.19.1 Linux/4.14.255-276-224.499.amzn2.x86_64'}

it was in fact not sending that request. The sdk's logging shows that URLError is occurring

[ERROR] URLError: <urlopen error timed out>
Traceback (most recent call last):
  File "/var/lang/lib/python3.7/imp.py", line 234, in load_module
    return load_source(name, filename, file)
  File "/var/lang/lib/python3.7/imp.py", line 171, in load_source
    module = _load(spec)
  File "<frozen importlib._bootstrap>", line 696, in _load
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/var/task/storeCredit.py", line 13, in <module>
    app = App(process_before_response=True)
  File "/opt/python/slack_bolt/app/app.py", line 329, in __init__
    url_verification_enabled=url_verification_enabled,
  File "/opt/python/slack_bolt/app/app.py", line 359, in _init_middleware_list
    auth_test_result = self._client.auth_test(token=self._token)
  File "/opt/python/slack_sdk/web/client.py", line 1532, in auth_test
    return self.api_call("auth.test", params=kwargs)
  File "/opt/python/slack_sdk/web/base_client.py", line 156, in api_call
    return self._sync_send(api_url=api_url, req_args=req_args)
  File "/opt/python/slack_sdk/web/base_client.py", line 194, in _sync_send
    additional_headers=headers,
  File "/opt/python/slack_sdk/web/base_client.py", line 294, in _urllib_api_call
    response = self._perform_urllib_http_request(url=url, args=request_args)
  File "/opt/python/slack_sdk/web/base_client.py", line 473, in _perform_urllib_http_request
    raise err
  File "/opt/python/slack_sdk/web/base_client.py", line 397, in _perform_urllib_http_request
    resp = self._perform_urllib_http_request_internal(url, req)
  File "/opt/python/slack_sdk/web/base_client.py", line 504, in _perform_urllib_http_request_internal
    resp = urlopen(req, context=self.ssl, timeout=self.timeout)  # skipcq: BAN-B310
  File "/var/lang/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/var/lang/lib/python3.7/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/var/lang/lib/python3.7/urllib/request.py", line 543, in _open
    '_open', req)
  File "/var/lang/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/var/lang/lib/python3.7/urllib/request.py", line 1393, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "/var/lang/lib/python3.7/urllib/request.py", line 1352, in do_open
    raise URLError(err)

Open question: has anyone experienced an issue with this on Lambda? Thanks in advance.

github-actions[bot] commented 1 year ago

👋 It looks like this issue has been open for 30 days with no activity. We'll mark this as stale for now, and wait 10 days for an update or for further comment before closing this issue out. If you think this issue needs to be prioritized, please comment to get the thread going again! Maintainers also review issues marked as stale on a regular basis and comment or adjust status if the issue needs to be reprioritized.

filmaj commented 1 year ago

Very interesting @JustinNewport . Are your Lambdas behind a VPC, by any chance? Unclear why opening a connection would timeout - perhaps DNS resolution? Maybe Slack experienced an outage around the time of these requests?

While it may be separate from this particular issue regarding URL opening timeout, I would also caution, in general, to not discount cold starts when working in Lambda. Lambdas can experience cold starts after either 30 mins of inactivity or an increase in the number of concurrent executions of a particular Lambda. In the former case, the number of Lambda containers warm and active and available to execute drops as idle time increases. In the latter case, as the number of concurrent executions rises, the need to spin up more concurrent containers able to run Lambdas increases, leading to an increase in cold starts (linearly with the number of new concurrent executions needed).

JustinNewport commented 1 year ago

@filmaj after your comment we looked into our vpc/subnet arrangement and that was indeed the issue. we had 2 subnets attached; one could reach the internet and the other could not, hence the intermittent failures.

filmaj commented 1 year ago

Glad to hear it got resolved!