slackapi / bolt-js

A framework to build Slack apps using JavaScript
https://slack.dev/bolt-js
MIT License
2.73k stars 392 forks source link

Bolt App Methods Hanging with `ETIMEDOUT` Warning During `chat.postMessage` Execution #2212

Open dianexjeong opened 4 weeks ago

dianexjeong commented 4 weeks ago

@slack/bolt version

    "@slack/bolt": "^3.13.3",

Your App and Receiver Configuration

const receiver = new ExpressReceiver({
  signingSecret: process.env.SLACK_SIGNING_SECRET,
  port: process.env.SLACK_APP_PORT,
  clientId: process.env.SLACK_CLIENT_ID,
  clientSecret: process.env.SLACK_CLIENT_SECRET,
  stateSecret: "my-state-secret",
  scopes: ["chat:write", "chat:write.customize", "im:history", "im:write", "users:read", "users:read.email"],
  installerOptions: {
    userScopes: ["im:write", "chat:write"]
  }
...
})

const app = new App({
  receiver,
});

Node.js runtime version

v18.17.1

Expected result:

I'm experiencing VERY slow speeds when using the overall features of my app. I would be expecting fast (if not immediate) responses with my features, which would be a) installing the app / b) opening modals / c) submitting modals / d) sending messages... etc.

I have experienced a very slow functioning of my app with this specific incident: a) I submitted a modal with select options and text inputs. b) The code that is triggered is as follows:

// 1) inserting information to my DB
   const relayQueueInsertQuery = `INSERT INTO relay_queue (relay_idx, team_id, user_id, triggered_messages_idx) VALUES (${relayIdx}, '${teamId}', '${relayUser}', ${messageInsertion.insertId});`;
      await queryDBWithPool(pool, relayQueueInsertQuery);

// 2) sending a message to user A
      await client.chat.postMessage({
        channel: relayUser,
...
})

// 3) sending a message to user B
      await client.chat.postMessage({
        channel: userId,
...
})

// 4) inserting information to my DB
  await savePostToDB(postMsg.ts, postMsg.channel, postMsg.message.team, "RELAY_WHALE_CHANNEL");

c) I expect that the above four (1~4) steps to occur immediately, if not within a second or two.

Actual result:

In the above incident, I observed that the 1) first step of inserting into the DB was completed immediately after the modal submission, but the rest steps 2) ~4) were not happening. It took over a minute for the rest steps to be completed, and it seemed like the app was hanging at the second step, the postMessage method. After a minute, the rest steps were executed almost simultaneously. The log showed a [WARN] bolt-app http request failed connect ETIMEDOUT 52.192.46.121:443 warning, which seems to be related to the incident. Here is the full log:

2024-08-16 21:53:55: [WARN]  bolt-app http request failed connect ETIMEDOUT 52.192.46.121:443

The timestamp on the left is the time of when it occurred, and I am in the Asia/Seoul time zone. There were no other error messages left with the above log

Our team has recently raised an issue regarding the axios ETIMEDOUT errors. (#2185) This has partially been resolved by using the native fetch method instead of using axios, as it seemed to be an error occurring from the axios itself. However, the warning message that I saw seems to be related to the bolt framework, not axios, since it doesn't have an axios tag to the warning.

As I commented in the issue, I am experiencing VERY slow functions of my app from time to time, even for modal opens and app installations - as I am getting expired_trigger_id errors as well. Unfortunately, it doesn't seem to be appearing in a fixed pattern, and I am not able to reproduce the error (issue) on demand.

Is this also related to how bolt is processing the events with axios? It would be a huge help if you could help us with this issue! This has been bugging our team for around a month already. :(

Thank you in advance!

Requirements

For general questions/issues about Slack API platform or its server-side, could you submit questions at https://my.slack.com/help/requests/new instead. :bow:

Please read the Contributing guidelines and Code of Conduct before creating this issue or pull request. By submitting, you are agreeing to those rules.

zimeg commented 4 weeks ago

Hey @dianexjeong 👋 Thanks for sharing this! I caught up on #2185 and believe these errors are caused by slow or failing server responses instead of app-related settings. It's super interesting that similar errors are found with both direct axios and client calls.

I do also find it interesting that this is showing most often with IPs associated with AWS Tokyo 🤔 We're poking into possible connectivity issues around this for both hooks.slack.com and regular API calls, but I don't have information to share right now.

Without having a reliable way to reproduce this debugging might be difficult, but I have a few questions and hopes:

I understand how frustrating both slow requests and inconsistent errors are, but I'm also hoping we can find the cause, or at least a fix, for this! 🙏

zimeg commented 4 weeks ago

I'm also wondering if updating Bolt to ^3.21.1 is possible? This latest version includes axios@1.7.3 which can use the fetch adapter as an option. Related: slackapi/node-slack-sdk#1525.

Edit: I misread the axios release notes! fetch is and option for the adapter and not the default! A test replacing this in @slack/web-api@7.3.4 seems to work in an app but fails unit tests 🤔

dianexjeong commented 4 weeks ago

Hello @zimeg!

Thank you so much for the response! 😄 I updated Bolt to ^3.21.1. Could you please confirm if there are any additional steps required to trigger the use of the fetch adapter after this update, or is simply updating Bolt sufficient to make this change?

Also regarding the questions:

Is this app hosted on AWS Lambda? This https://github.com/slackapi/bolt-js/issues/1643#issuecomment-1302471011 might be useful in processing responses if so!

No, I believe the app is not hosted on AWS Lambda.

Is it possible to test with shadowing fetch calls - meaning duplicating the client.chat.postMessage with a direct use of fetch? I'm hoping to find if these errors happen for both requests or if it's just axios specific.

Sure, I'll test it out and bring back the results!

Does your app make additional HTTP requests? I'm wondering if this could be caused by general connection issues that are unrelated to the Slack API.

I am calling the usageCC module mentioned in https://github.com/slackapi/bolt-js/issues/2185 - but it is called at the last step after all the client calls are executed, which means the postMessage was hanging, and the HTTP request in the usageCC was not made, therefore I'm not suspecting that the additional HTTP request was causing the warning.

zimeg commented 3 weeks ago

@dianexjeong For sure! Thanks for sharing more about how these requests are being processed 🙏

Could you please confirm if there are any additional steps required to trigger the use of the fetch adapter after this update, or is simply updating Bolt sufficient to make this change?

I was a bit eager thinking this, the default didn't change but fetch did become an option 😳 A few changes to @slack/web-api and Bolt can make this an experimental option to test but I haven't yet looked into this. From this comment I'm not certain if fetch will make a difference though 🤔

We're searching for server details around timeout issues, but I'm also curious if your app is running behind a proxy or VPN? Sometimes this can upset requests or responses without being super clear 👀

dianexjeong commented 3 weeks ago

No problem @zimeg! Please let me know if you need any additional information. I would be happy to provide as much as I can 😄

Yeah, I tried refactoring the http POST request with fetch but nothing changed, and have rolled back to the original axios method.

I'm also curious if your app is running behind a proxy or VPN?

Nope, we are not running behind a proxy or VPN. Requests are routed through AWS's Elastic Load Balancer (ELB) and then directed to an EC2 instance hosted in the Seoul Region. The web server is running on Nginx, and the application is being directly managed by PM2 process managers.

zimeg commented 3 weeks ago

@dianexjeong Some experimental changes are on the web-api-6.x-experimental-axios-adapter branch if you're wanting to try this - a few build instructions are in the commit! I'm uncertain how well fetch works in all cases but in quick testing it seems fine.

~Somehow I was redirected to the actual chat.update page in browser with one attempt, but I was also running two connections to the app~ 🤔 ~Never seen this before!~ Edit: Realized I just clicked a link from the logs 😂

The additional details are super helpful and I can't think of immediate problems with the setup, but perhaps some logs show something strange in either inbound or outbound requests?

dianexjeong commented 3 weeks ago

The additional details are super helpful and I can't think of immediate problems with the setup, but perhaps some logs show something strange in either inbound or outbound requests?

@zimeg I'm not getting any logs other than the [WARN] warning. 😞 If there's a way I can log some information that could be helpful for breaking this problem down, please let me know!

dianexjeong commented 3 weeks ago

2024-08-19 14:12:41: [WARN] bolt-app http request failed connect ETIMEDOUT 35.74.58.174:443

@zimeg A quick followup - got this again. Is there a way I can get notified whenever these warnings occur? I guess they are not categorized as 'errors', and therefore it is not triggering the errorCC that I have created. (Right now, I am catching these errors whilst staring into the log display! 😅 ) Also, is there a way I can add the full body of the error or the full warning message to the WARN log? I think it would be helpful if I could know which method was triggered and ran into a time out, but I'm not seeing any useful information in the one lined log.

dianexjeong commented 3 weeks ago

@zimeg New update! I ran into this error with another app that I am running. This time, it occurred during my Axios call of chat.postMessage.

I noticed that the ETIMEDOUT error is occurring in all sorts of requests that my apps are sending over to the Slack's server, including a) POST to webhooks, b) Axios calls of chat.postMessage, c) calls of client methods within the app. All of the error messages, regardless of the request type and request app, point to the following four IP addresses:

the above IPs belong to Slack, in the AWS, Tokyo region - as it was the same IP with the result of nslookup of https://hooks.slack.com.

I'm assuming that the ETIMEDOUT errors are originating from the Slack's servers. Could you look into this?

Really appreciate your help!

bonespiked commented 3 weeks ago

FWIW - I've been getting the ETIMEDOUT errors when using axios against the slack APIs lately - but only for like the past 3-4 weeks - prior to that, all have been working fine. I have about 13 IPs that seem to be hit with some overlap above.

tomohiroyoshida commented 2 weeks ago

Hello!

My Slack app running on AWS ECS(region: ap-northeast-1) has also started to occur this error several times since the beginning of the month of August 2024.😢 It has been in operation for over a year but this error has never occurred before.

filmaj commented 2 days ago

Hi, checking in here. Are you all still experiencing issues @dianexjeong @bonespiked @tomohiroyoshida ? It seems to be regional, from what I can tell. Also worth double checking to see if any proxies are in play in your situations?

dianexjeong commented 1 day ago

Hello @filmaj! Thanks for checking in.

Yes, the timeouts are still present and nope, we are not running behind any proxies.

filmaj commented 1 day ago

@tomohiroyoshida mentioned their app runs in the AWS Tokyo region, and the IP that @dianexjeong's app resolved the Slack API to also exist in the AWS Tokyo region.

@SEEWON @dianexjeong @NatsukiOka @bonespiked can you confirm which hosting provider / what region your app runs in? The reports here seem to point to low-level infrastructural issues in and around the AWS Tokyo region - want to confirm whether this is perhaps a regional problem, and knowing where your app is hosted can help confirm or disprove this theory. Thanks!

dianexjeong commented 1 day ago

@SEEWON and I are also hosting our app in the AWS Tokyo region!

bonespiked commented 1 day ago

For ours, it wasn't a slack issue - someone had dorked up all our routing tables so 1/3 of requests were correct, 1/3 were being routed out of EU, and 1/3 were dead. It was quite a tirade of profanity upon my discovery of this....

filmaj commented 1 day ago

Sorry to hear about that @bonespiked ! Good news that y'all figured it out though. It also partially confirms my hypothesis that this has to do with low-level networking issues - misconfigured routing tables at least make sense in terms of ETIMEDOUT and other related network errors being raised in this situation.

For the record, we have had other customer reports come in that are very similar to this, and all are in or close to the AWS Tokyo region. We continue to work with our infrastructure team to try to get more visibility into this issue. It may even be an AWS problem - hard to say at this point. Any additional information you can provide to us is helpful (@dianexjeong) such as:

dianexjeong commented 1 day ago

Seems relieving that it wasn't just my problem :)

To answer your questions,

Please let me know if you need any other information! I'd be glad to provide as much as I can.