slackapi / java-slack-sdk

Slack Developer Kit (including Bolt for Java) for any JVM language
https://slack.dev/java-slack-sdk/
MIT License
570 stars 209 forks source link

busy to open/close a connection when running slack socket mode for days #1256

Closed oppokui closed 7 months ago

oppokui commented 8 months ago

I used java-slack-sdk v1.36.1 to create a socket client application. Most of time it works perfectly. But when it run for several days, it will jump to open/close session frequently and report rate limited 429 error code. Then the bolt app was hanged there, no response.

Here is some code structures:

App app = new App(AppConfig.builder().singleTeamBotToken(botToken).build());
app.event(AppMentionEvent.class, new BoltEventHandler<AppMentionEvent> () {
   ...
});
new SocketModeApp(SLACK_APP_TOKEN, app).start();

The log when the issue happened:

2023-12-25 01:49:03,909 INFO  [com.vitria.app.genai.service.StorageService] After finishing a batch commit (1), queue left: 0
2023-12-25 01:50:08,864 INFO  [com.vitria.app.genai.service.StorageService] Connection is not used for a while (60 s), close it.
2023-12-25 04:22:57,839 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: d7622816-e949-48cc-9aca-fa77e7dfed2b)
2023-12-25 04:22:57,940 INFO  [com.slack.api.socket_mode.SocketModeClient] onClose listener is called (session id: 0349f864-07ec-4cde-a92e-e8b5185bfe9f, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
2023-12-25 06:54:44,628 INFO  [com.slack.api.socket_mode.SocketModeClient] The current session is no longer active. Going to reconnect to the Socket Mode server.
2023-12-25 06:54:44,769 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: 6bbaf272-8758-406e-8bb5-f861d4b56306)
2023-12-25 06:54:47,877 INFO  [com.slack.api.socket_mode.SocketModeClient] onClose listener is called (session id: d7622816-e949-48cc-9aca-fa77e7dfed2b, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
2023-12-25 07:08:16,777 INFO  [com.slack.api.socket_mode.SocketModeClient] The current session is no longer active. Going to reconnect to the Socket Mode server.
2023-12-25 07:08:21,928 ERROR [com.slack.api.socket_mode.SocketModeClient] Failed to reconnect to Socket Mode server: javax.websocket.DeploymentException: Handshake error.
2023-12-25 07:08:30,035 INFO  [com.slack.api.socket_mode.SocketModeClient] The current session is no longer active. Going to reconnect to the Socket Mode server.
2023-12-25 07:08:30,127 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: 1ffb916d-ba96-44e8-a0b1-230e85208207)
2023-12-25 07:08:33,241 INFO  [com.slack.api.socket_mode.SocketModeClient] onClose listener is called (session id: 6bbaf272-8758-406e-8bb5-f861d4b56306, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
2023-12-25 07:08:33,242 INFO  [com.slack.api.socket_mode.SocketModeClient] onClose listener is called (session id: 1ffb916d-ba96-44e8-a0b1-230e85208207, reason: Closed abnormally.)
2023-12-25 07:08:33,305 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: 9e9cba35-5857-454b-abf9-f4c37a64ae44)
2023-12-25 07:08:33,316 INFO  [com.slack.api.socket_mode.SocketModeClient] onClose listener is called (session id: 9e9cba35-5857-454b-abf9-f4c37a64ae44, reason: Closed abnormally.)
2023-12-25 07:08:33,342 WARN  [com.slack.api.socket_mode.SocketModeClient] Failed to send a ping message (session id: 9e9cba35-5857-454b-abf9-f4c37a64ae44, error: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@407dd627 rejected from java.util.concurrent.ScheduledThreadPoolExecutor@4e08e25d[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 4])
2023-12-25 07:08:33,363 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: c8acc397-cf96-4889-afd4-39c9d81a3064)
2023-12-25 07:08:33,375 INFO  [com.slack.api.socket_mode.SocketModeClient] onClose listener is called (session id: c8acc397-cf96-4889-afd4-39c9d81a3064, reason: Closed abnormally.)
2023-12-25 07:08:33,382 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: 25884208-430c-442c-9f4a-8dc1f24695e3)
2023-12-25 07:08:33,392 INFO  [com.slack.api.socket_mode.SocketModeClient] onClose listener is called (session id: 25884208-430c-442c-9f4a-8dc1f24695e3, reason: Closed abnormally.)
2023-12-25 07:08:33,412 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: cc95f81a-e8cd-488f-9fdb-349c833aed86)
2023-12-25 07:08:33,423 INFO  [com.slack.api.socket_mode.SocketModeClient] onClose listener is called (session id: cc95f81a-e8cd-488f-9fdb-349c833aed86, reason: Closed abnormally.)
2023-12-25 07:08:33,428 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: 25f04ce7-c3f8-47a8-8d5f-c3cdefd5bb7a)
2023-12-25 07:08:33,438 INFO  [com.slack.api.socket_mode.SocketModeClient] onClose listener is called (session id: 25f04ce7-c3f8-47a8-8d5f-c3cdefd5bb7a, reason: Closed abnormally.)
2023-12-25 07:08:33,481 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: c72abab9-aceb-4092-b068-3cee1a9a15eb)
2023-12-25 07:08:33,490 INFO  [com.slack.api.socket_mode.SocketModeClient] onClose listener is called (session id: c72abab9-aceb-4092-b068-3cee1a9a15eb, reason: Closed abnormally.)
2023-12-25 07:08:33,494 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: 0f882241-29bb-4086-acaf-4d035d700217)
2023-12-25 07:08:33,512 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: d5751d21-0920-47bd-a892-aeb9b6cdde51)
2023-12-25 07:08:33,542 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: f1b41a04-9fe8-435f-8440-97a0085eac81)
2023-12-25 07:08:33,572 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: f7d9e582-ba7a-4767-bcce-8a79ab759355)
2023-12-25 07:08:33,613 INFO  [com.slack.api.socket_mode.SocketModeClient] onClose listener is called (session id: 0f882241-29bb-4086-acaf-4d035d700217, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
2023-12-25 07:08:33,613 INFO  [com.slack.api.socket_mode.SocketModeClient] onClose listener is called (session id: d5751d21-0920-47bd-a892-aeb9b6cdde51, reason: Closed abnormally.)
2023-12-25 07:08:33,659 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: db084d14-5617-4d44-8145-b470ba4fc82a)
2023-12-25 07:08:33,660 WARN  [com.slack.api.socket_mode.SocketModeClient] Failed to send a ping message (session id: db084d14-5617-4d44-8145-b470ba4fc82a, error: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@1fab81c5 rejected from java.util.concurrent.ScheduledThreadPoolExecutor@39422152[Shutting down, pool size = 1, active threads = 0, queued tasks = 0, completed tasks = 3])
2023-12-25 07:08:33,708 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: 5f2d86c7-d969-457d-9a54-b619ab98bdf5)

We mentioned this bolt app in slack desktop client and send the message, the bolt can receive the message and response it. The code worked for 2 or 3 days. Then at "2023-12-25 04:22:57", it stayed at idle for 1 and half hours and started to report open/close session issue. Then the system is going to the dead loop of opening a session but the slack server closing it immediately, then trying to open a new, then reach the rate limits of slack server. Then the bolt app was offline, and no response for any request in slack message.

New session is open (session id: d7622816-e949-48cc-9aca-fa77e7dfed2b)
onClose listener is called (session id: 0349f864-07ec-4cde-a92e-e8b5185bfe9f, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)

How can I add some flag to wait several seconds after a connection is closed so it won't be busy to open a new connection? What is the real cause of this problem?

oppokui commented 8 months ago

slack_threads.txt

Attach the thread dump.

seratch commented 8 months ago

Hi @oppokui, thanks for writing in.

In your situation, Bolt repeatedly tries to reconnect. This indicates that an established connection had been unexpectedly terminated by an external factor. This situation could result from some network connection instability between Slack and our app process. We haven't observed a large scale outage of the Slack API server side these days. Therefore, perhaps the cause could be something on your network infra side. As a solution, making the connectivity more stable in some help may resolve/mitigate the issue, but I cannot suggest anything specific to your situation.

seratch commented 8 months ago

Another possible solution could be to run multiple instances with health check monitoring. You can check if SocketModeApp#getClient()#verifyConnection() returns true. If it returns false, you monitoring system can terminate the process and spin up a new one. You can run up to 10 instances for a Socket Mode app.

oppokui commented 8 months ago

Cool! Thanks for so quick response. The suggestion makes sense. I will try to relaunch the bolt app once the client is dead.

By the way, the bolt app was running in amazon EC2 instance (us-east-1). Don't think the network has any problem, because when I restart the JVM process, it is restored.

I just restarted it, see the log:

2023-12-26  07:18:44,913    INFO    SocketModeClient    New session is open (session id: b72f6910-69a6-4403-860f-91c3de5203bf)
2023-12-26  07:18:44,914    INFO    SocketModeClient    onClose listener is called (session id: 755780b3-658e-4a38-bc14-07f1df5ebf3b, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
2023-12-26  07:18:44,933    INFO    SocketModeClient    New session is open (session id: 365eaf44-37e1-4049-9c35-2f15d7546b1b)
2023-12-26  07:18:44,969    INFO    SocketModeClient    New session is open (session id: 5025e4de-9e51-435f-8069-9e0ea8ded54e)
2023-12-26  07:18:46,970    INFO    SocketModeClient    onClose listener is called (session id: af665f18-fcda-4662-9313-3da86b3b8e9b, reason: Closed abnormally.)
2023-12-26  07:18:47,071    INFO    SocketModeClient    onClose listener is called (session id: 69ff9cc1-0281-4393-a36b-b9e3ac19cb26, reason: Closed abnormally.)
2023-12-26  07:18:47,126    INFO    SocketModeClient    New session is open (session id: d43392a7-008b-49c5-9722-25b36b25a57c)
2023-12-26  07:18:47,137    INFO    SocketModeClient    onClose listener is called (session id: d43392a7-008b-49c5-9722-25b36b25a57c, reason: Closed abnormally.)
2023-12-26  07:18:47,159    INFO    SocketModeClient    New session is open (session id: 2938968e-36dc-4718-8a90-4be8f9998a80)
2023-12-26  07:18:47,169    INFO    SocketModeClient    onClose listener is called (session id: 2938968e-36dc-4718-8a90-4be8f9998a80, reason: Closed abnormally.)
2023-12-26  07:18:47,187    INFO    SocketModeClient    onClose listener is called (session id: bf9419c1-82a0-49b2-80e6-1af5f74a235e, reason: Closed abnormally.)
2023-12-26  07:18:47,198    INFO    SocketModeClient    onClose listener is called (session id: b3a7efd2-7220-4189-9beb-ecc098940183, reason: Closed abnormally.)
2023-12-26  07:18:47,240    INFO    SocketModeClient    New session is open (session id: 54af17aa-6e91-4d13-b39d-778be9efe208)
2023-12-26  07:18:47,240    INFO    SocketModeClient    New session is open (session id: afc3e2f5-87e6-43ce-8f45-af43180dfb50)
2023-12-26  07:18:47,249    INFO    SocketModeClient    onClose listener is called (session id: 54af17aa-6e91-4d13-b39d-778be9efe208, reason: Closed abnormally.)
2023-12-26  07:18:47,263    ERROR   TeamIdCache Failed to call auth.test API (error: status: 429, error: ratelimited, needed: null, provided: null, warning: null)
2023-12-26  07:18:47,285    INFO    SocketModeClient    New session is open (session id: f4fe5546-ec54-4a4a-9cd7-d8ec7938a646)
2023-12-26  07:18:47,287    INFO    SocketModeClient    onClose listener is called (session id: 4748acaa-4916-453c-8fbc-3d35e4f4eec5, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
2023-12-26  07:18:47,287    INFO    SocketModeClient    onClose listener is called (session id: 2bca4ca0-c151-44fa-b682-c06d28f4bb0d, reason: Closed abnormally.)
2023-12-26  07:18:47,293    INFO    SocketModeClient    onClose listener is called (session id: 5d6a11e3-c43f-402a-8ec2-0b404cd37061, reason: Closed abnormally.)
2023-12-26  07:18:47,293    INFO    SocketModeClient    onClose listener is called (session id: c8bcc0c0-2e65-47db-93e8-96ec713de914, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
2023-12-26  07:18:47,313    INFO    SocketModeClient    New session is open (session id: ec9cd756-612e-4cd8-990b-95efb1663b21)
2023-12-26  07:18:47,323    INFO    SocketModeClient    New session is open (session id: 82d89012-35de-439a-8abc-dfab2be78b33)
2023-12-26  07:18:47,341    WARN    SocketModeClient    Failed to send a ping message (session id: ec9cd756-612e-4cd8-990b-95efb1663b21, error: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@5b3e1260 rejected from java.util.concurrent.ScheduledThreadPoolExecutor@3c7857d2[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 6])
2023-12-26  07:18:47,341    INFO    SocketModeClient    onClose listener is called (session id: afc3e2f5-87e6-43ce-8f45-af43180dfb50, reason: Closed abnormally.)
2023-12-26  07:18:47,350    INFO    SocketModeClient    New session is open (session id: 016d93c8-32c2-468d-ae22-d4392049cb0e)
2023-12-26  07:18:47,363    INFO    SocketModeClient    onClose listener is called (session id: 016d93c8-32c2-468d-ae22-d4392049cb0e, reason: Closed abnormally.)
2023-12-26  07:18:47,367    INFO    SocketModeClient    New session is open (session id: 6b613cef-0b24-4024-a942-3c7e6e9d6e78)
2023-12-26  07:18:47,379    INFO    SocketModeClient    onClose listener is called (session id: 6b613cef-0b24-4024-a942-3c7e6e9d6e78, reason: Closed abnormally.)
2023-12-26  07:18:47,388    WARN    SocketModeClient    Failed to send a ping message (session id: 6b613cef-0b24-4024-a942-3c7e6e9d6e78, error: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@4f3307ba rejected from java.util.concurrent.ScheduledThreadPoolExecutor@3174b7b8[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 4])
2023-12-26  07:18:47,409    INFO    SocketModeClient    New session is open (session id: 703f88e8-4b92-4995-8fc4-db7d8a0d1b96)
2023-12-26  07:18:47,424    INFO    SocketModeClient    New session is open (session id: dbcfeb0b-1807-4415-93d5-cf6a8a4fc088)
2023-12-26  07:18:47,439    INFO    SocketModeClient    New session is open (session id: b4dee05f-d687-4be2-a60e-46d739ed72f5)
2023-12-26  07:18:47,463    INFO    SocketModeClient    New session is open (session id: df2bbcda-1352-48b9-8521-d633fbf97979)
2023-12-26  07:18:47,482    INFO    SocketModeClient    New session is open (session id: 4003087e-b115-414c-b066-8524f4e37aa4)
2023-12-26  07:18:47,489    INFO    SocketModeClient    onClose listener is called (session id: 703f88e8-4b92-4995-8fc4-db7d8a0d1b96, reason: Closed abnormally.)
2023-12-26  07:18:47,518    INFO    SocketModeClient    onClose listener is called (session id: fb8ca60d-a961-4462-9a1f-8b1205e2f4bc, reason: Closed abnormally.)
2023-12-26  07:18:47,518    INFO    SocketModeClient    onClose listener is called (session id: 78d582ad-72ad-451d-9c18-c096909c1ca9, reason: Closed abnormally.)
2023-12-26  07:18:47,578    INFO    SocketModeClient    New session is open (session id: 20fd4a0d-80cf-4d0e-a19b-70283f023dca)
2023-12-26  07:18:47,582    WARN    SocketModeClient    Failed to send a ping message (session id: 20fd4a0d-80cf-4d0e-a19b-70283f023dca, error: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@4a4a2600 rejected from java.util.concurrent.ScheduledThreadPoolExecutor@60e0a1ed[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 21])
2023-12-26  07:18:47,583    INFO    SocketModeClient    onClose listener is called (session id: 4003087e-b115-414c-b066-8524f4e37aa4, reason: Closed abnormally.)
2023-12-26  07:18:47,588    INFO    SocketModeClient    onClose listener is called (session id: 18e96bac-6a6e-4014-ab6b-90e8f5f830dc, reason: Closed abnormally.)
2023-12-26  07:18:47,594    INFO    SocketModeClient    New session is open (session id: fc2f3b5a-2005-4616-82e2-ad381fa03f1d)
2023-12-26  07:18:47,596    INFO    SocketModeClient    New session is open (session id: e5fe85df-dab6-44f3-9bf0-1a01687f2d85)
2023-12-26  07:18:47,605    INFO    SocketModeClient    onClose listener is called (session id: fc2f3b5a-2005-4616-82e2-ad381fa03f1d, reason: Closed abnormally.)
2023-12-26  07:18:47,608    INFO    SocketModeClient    onClose listener is called (session id: 20fd4a0d-80cf-4d0e-a19b-70283f023dca, reason: Closed abnormally.)
2023-12-26  07:18:47,611    INFO    SocketModeClient    New session is open (session id: 7414b180-4207-46c2-9521-464f7fe69d47)
2023-12-26  07:18:47,623    INFO    SocketModeClient    onClose listener is called (session id: 7414b180-4207-46c2-9521-464f7fe69d47, reason: Closed abnormally.)
2023-12-26  07:18:47,625    WARN    SocketModeClient    Failed to send a ping message (session id: 7414b180-4207-46c2-9521-464f7fe69d47, error: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@4cc0cda2 rejected from java.util.concurrent.ScheduledThreadPoolExecutor@60e0a1ed[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 21])
2023-12-26  07:18:47,625    INFO    SocketModeClient    onClose listener is called (session id: dbcfeb0b-1807-4415-93d5-cf6a8a4fc088, reason: Closed abnormally.)
2023-12-26  07:18:47,625    INFO    SocketModeClient    onClose listener is called (session id: c6156f44-fb07-4740-9bab-cf8ce7aa94d2, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
2023-12-26  07:18:47,640    WARN    SocketModeClient    Failed to send a ping message (session id: dbcfeb0b-1807-4415-93d5-cf6a8a4fc088, error: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@1be817d0 rejected from java.util.concurrent.ScheduledThreadPoolExecutor@60e0a1ed[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 21])
2023-12-26  07:18:47,640    INFO    SocketModeClient    onClose listener is called (session id: b4dee05f-d687-4be2-a60e-46d739ed72f5, reason: Closed abnormally.)
2023-12-26  07:18:47,658    INFO    SocketModeClient    New session is open (session id: 0cf0f59f-7af7-4c05-a131-5abf3de8f256)
2023-12-26  07:18:47,664    WARN    SocketModeClient    Failed to send a ping message (session id: 0cf0f59f-7af7-4c05-a131-5abf3de8f256, error: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@3700aac3 rejected from java.util.concurrent.ScheduledThreadPoolExecutor@60e0a1ed[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 21])
2023-12-26  07:18:47,664    INFO    SocketModeClient    onClose listener is called (session id: a2d2570f-6f28-4736-a809-cddefd280b9a, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
2023-12-26  07:18:47,665    INFO    SocketModeClient    onClose listener is called (session id: df2bbcda-1352-48b9-8521-d633fbf97979, reason: Closed abnormally.)
2023-12-26  07:18:47,671    INFO    SocketModeClient    onClose listener is called (session id: 92a94abb-ed63-4522-9d3e-641307ed1135, reason: Closed abnormally.)
2023-12-26  07:18:47,678    INFO    SocketModeClient    New session is open (session id: 9c1d691c-a559-4119-b81e-75cf1bc011cb)
2023-12-26  07:18:47,684    INFO    SocketModeClient    New session is open (session id: f3bf115d-1366-4847-a061-0bc1d18bba16)
2023-12-26  07:18:47,715    INFO    SocketModeClient    onClose listener is called (session id: f3bf115d-1366-4847-a061-0bc1d18bba16, reason: Closed abnormally.)
2023-12-26  07:18:47,726    INFO    SocketModeClient    onClose listener is called (session id: 18a3f154-03e4-4318-881c-f72bd27bfffe, reason: Closed abnormally.)
2023-12-26  07:18:47,756    INFO    SocketModeClient    New session is open (session id: 6f6bbce9-2410-49e3-a078-cbb93b9aa56d)
2023-12-26  07:18:47,757    INFO    SocketModeClient    New session is open (session id: c3cdc391-3a21-4064-99c3-0990494dbd81)
2023-12-26  07:18:47,757    INFO    SocketModeClient    onClose listener is called (session id: b4734556-636c-4274-ad91-0b986f10f9ef, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
2023-12-26  07:18:47,760    INFO    SocketModeClient    onClose listener is called (session id: 6f6bbce9-2410-49e3-a078-cbb93b9aa56d, reason: Closed abnormally.)
2023-12-26  07:18:47,767    ERROR   TeamIdCache Failed to call auth.test API (error: status: 429, error: ratelimited, needed: null, provided: null, warning: null)
2023-12-26  07:18:47,777    INFO    SocketModeClient    onClose listener is called (session id: f92629b1-aa13-4505-aee2-091f57562a5d, reason: Closed abnormally.)
2023-12-26  07:18:47,779    INFO    SocketModeClient    New session is open (session id: db10afd3-6080-407b-b1b5-fa21de04f4a3)
2023-12-26  07:18:47,780    INFO    SocketModeClient    New session is open (session id: b34d8ab8-6bdc-42c0-b0df-2f9a90b19ba9)
2023-12-26  07:18:47,784    ERROR   TeamIdCache Failed to call auth.test API (error: status: 429, error: ratelimited, needed: null, provided: null, warning: null)
2023-12-26  07:18:47,785    INFO    SocketModeClient    New session is open (session id: 30f23ea5-0795-468d-830f-3e1124a4ca34)
2023-12-26  07:18:47,800    INFO    SocketModeClient    New session is open (session id: 21881d5c-aa3c-443a-aa60-e1684b1c629a)
2023-12-26  07:18:47,810    INFO    SocketModeClient    New session is open (session id: 164975af-2081-4913-8afb-1021d142b45a)
2023-12-26  07:18:47,815    INFO    SocketModeClient    New session is open (session id: 16d0ac5a-de21-47d8-96c2-3da155a33323)
2023-12-26  07:18:47,821    INFO    SocketModeClient    New session is open (session id: dd2cc9c0-6bda-4f1c-906f-0f666cdcd5ac)
2023-12-26  07:18:47,847    INFO    SocketModeClient    New session is open (session id: 2956f48c-202f-47a1-a168-28a4ad6b67ab)
2023-12-26  07:18:47,855    INFO    SocketModeClient    New session is open (session id: 2ef52f16-4ea3-4fe3-8181-c2dccb834901)
2023-12-26  07:18:48,054    ERROR   SocketModeClient    Failed to reconnect to the Socket Mode server: Failed to connect to the Socket Mode API endpoint. (status: 429, error: ratelimited)
2023-12-26  07:18:47,967    WARN    SocketModeClient    Failed to send a ping message (session id: 16d0ac5a-de21-47d8-96c2-3da155a33323, error: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@2e7f8443 rejected from java.util.concurrent.ScheduledThreadPoolExecutor@28e00124[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 6])
2023-12-26  07:18:47,916    INFO    SocketModeClient    onClose listener is called (session id: 16d0ac5a-de21-47d8-96c2-3da155a33323, reason: Closed abnormally.)
2023-12-26  07:18:47,901    INFO    SocketModeClient    onClose listener is called (session id: 21881d5c-aa3c-443a-aa60-e1684b1c629a, reason: Closed abnormally.)
2023-12-26  07:18:47,900    INFO    SocketModeClient    New session is open (session id: b71b8e17-d9cd-40b0-87e3-fee4838b250f)
2023-12-26  07:18:48,064    INFO    SocketModeClient    onClose listener is called (session id: 11730611-0a5c-4978-bee2-cd1f6fc19af7, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
2023-12-26  07:18:48,083    INFO    SocketModeClient    New session is open (session id: 42ea5a5e-39f4-4a57-a18a-d2b8c5ebc84b)
2023-12-26  07:18:48,121    INFO    SocketModeClient    New session is open (session id: 2e8d5e7a-589b-47aa-b600-0a2fa5491af3)
2023-12-26  07:21:52,879    INFO    VIASchedulerLauncher    Start arguments: via
2023-12-26  07:21:52,880    INFO    VIASchedulerLauncher    Run Quartz Application standalone: false
2023-12-26  07:21:52,914    ERROR   STDOUT  SLF4J: Class path contains multiple SLF4J bindings.
2023-12-26  07:21:52,914    ERROR   STDOUT  SLF4J: Found binding in [jar:file:/home/vitria/oi/wildfly/modules/com/vitria/thirdparty/slf4j/main/slf4j-log4j12-1.5.11.jar!/org/slf4j/impl/StaticLoggerBinder.class]
2023-12-26  07:21:52,914    ERROR   STDOUT  SLF4J: Found binding in [jar:file:/home/vitria/oi/VIAOps/llm/jar/slack/slf4j-simple-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
2023-12-26  07:21:52,914    ERROR   STDOUT  SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
2023-12-26  07:21:52,979    INFO    MLog    MLog clients using log4j logging.
2023-12-26  07:21:53,084    INFO    C3P0Registry    Initializing c3p0-0.9.1.2 [built 21-May-2007 15:04:56; debug? true; trace: 10]
2023-12-26  07:21:53,123    INFO    StdSchedulerFactory Using default implementation for ThreadExecutor
2023-12-26  07:21:53,142    INFO    SchedulerSignalerImpl   Initialized Scheduler Signaller of type: class org.quartz.core.SchedulerSignalerImpl
2023-12-26  07:21:53,142    INFO    QuartzScheduler Quartz Scheduler v.2.3.0-SNAPSHOT created.
2023-12-26  07:21:53,143    INFO    VIAXMLSchedulingDataProcessorPlugin Registering Quartz Job Initialization Plug-in.
2023-12-26  07:21:53,143    INFO    JobStoreTX  Using thread monitor-based data access locking (synchronization).
2023-12-26  07:21:53,144    INFO    JobStoreTX  JobStoreTX initialized.
2023-12-26  07:21:53,145    INFO    QuartzScheduler Scheduler meta-data: Quartz Scheduler (v2.3.0-SNAPSHOT) 'vt0e651f92-7297-4f1e-a819-a67ae9364a97__GenAI_Assistant$Deployment' with instanceId 'NON_CLUSTERED'
  Scheduler class: 'org.quartz.core.QuartzScheduler' - running locally.
  NOT STARTED.
  Currently in standby mode.
  Number of jobs executed: 0
  Using thread pool 'org.quartz.simpl.SimpleThreadPool' - with 40 threads.
  Using job-store 'org.quartz.impl.jdbcjobstore.JobStoreTX' - which supports persistence. and is not clustered.
2023-12-26  07:21:53,145    INFO    StdSchedulerFactory Quartz scheduler 'vt0e651f92-7297-4f1e-a819-a67ae9364a97__GenAI_Assistant$Deployment' initialized from specified file: '/home/vitria/oi/runtime/apps/_appdata_job__prj_vt0e651f92-7297-4f1e-a819-a67ae9364a97__GenAI_Assistant-Deployment/conf/quartz.properties'
2023-12-26  07:21:53,145    INFO    StdSchedulerFactory Quartz scheduler version: 2.3.0-SNAPSHOT
2023-12-26  07:21:53,291    INFO    JmxManager  Register JMX Service to : service:jmx:rmi://0.0.0.0:33692/jndi/rmi://via1:36211/jmxrmi mbean name = com.vitria:service=jobAppMonitor,name=_appdata_job__prj_vt0e651f92-7297-4f1e-a819-a67ae9364a97__GenAI_Assistant-Deployment with url: service:jmx:rmi://0.0.0.0:39133/jndi/rmi://via1:43316/jmxrmi  and mbean: com.vitria:service=JobDriverMonitor
2023-12-26  07:21:53,344    INFO    VIASchedulerLauncher    ------- Initialization Complete -----------
2023-12-26  07:21:53,344    INFO    VIASchedulerLauncher    ------- Not Scheduling any Jobs - relying on XML definitions --
2023-12-26  07:21:53,344    INFO    VIASchedulerLauncher    ------- Starting Scheduler ----------------
2023-12-26  07:21:53,777    INFO    AbstractPoolBackedDataSource    Initializing c3p0 pool... com.mchange.v2.c3p0.ComboPooledDataSource [ acquireIncrement -> 3, acquireRetryAttempts -> 30, acquireRetryDelay -> 1000, autoCommitOnClose -> false, automaticTestTable -> null, breakAfterAcquireFailure -> false, checkoutTimeout -> 0, connectionCustomizerClassName -> null, connectionTesterClassName -> com.mchange.v2.c3p0.impl.DefaultConnectionTester, dataSourceName -> 2tu6hvb019s5i0k12bxij3|256216b3, debugUnreturnedConnectionStackTraces -> false, description -> null, driverClass -> com.mysql.jdbc.Driver, factoryClassLocation -> null, forceIgnoreUnresolvedTransactions -> false, identityToken -> 2tu6hvb019s5i0k12bxij3|256216b3, idleConnectionTestPeriod -> 50, initialPoolSize -> 3, jdbcUrl -> jdbc:mysql://master:3306/via_dev?useSSL=false, maxAdministrativeTaskTime -> 0, maxConnectionAge -> 0, maxIdleTime -> 0, maxIdleTimeExcessConnections -> 0, maxPoolSize -> 5, maxStatements -> 0, maxStatementsPerConnection -> 120, minPoolSize -> 1, numHelperThreads -> 3, numThreadsAwaitingCheckoutDefaultUser -> 0, preferredTestQuery -> select 0, properties -> {user=******, password=******}, propertyCycle -> 0, testConnectionOnCheckin -> true, testConnectionOnCheckout -> false, unreturnedConnectionTimeout -> 0, usesTraditionalReflectiveProxies -> false ]
2023-12-26  07:21:54,140    INFO    VIASchedulerLauncher    Find existed action 'main.CheckSlack' (unchanged).
2023-12-26  07:21:54,162    INFO    VIASchedulerLauncher    Find existed trigger 'main.once_CheckSlack' (unchanged).
2023-12-26  07:21:54,165    INFO    VIASchedulerLauncher    Remove old trigger: RECOVERING_JOBS.recover_NON_CLUSTERED_1703388117455
2023-12-26  07:21:54,177    INFO    JobStoreTX  Freed 1 triggers from 'acquired' / 'blocked' state.
2023-12-26  07:21:54,179    INFO    JobStoreTX  Handling 2 trigger(s) that missed their scheduled fire-time.
2023-12-26  07:21:54,204    INFO    JobStoreTX  Recovering 1 jobs that were in-progress at the time of the last shut-down.
2023-12-26  07:21:54,213    INFO    JobStoreTX  Recovery complete.
2023-12-26  07:21:54,214    INFO    JobStoreTX  Removed 0 'complete' triggers.
2023-12-26  07:21:54,215    INFO    JobStoreTX  Removed 1 stale fired job entries.
2023-12-26  07:21:54,245    INFO    VIAXMLSchedulingDataProcessor   Parsing XML file: /home/vitria/oi/runtime/apps/_appdata_job__prj_vt0e651f92-7297-4f1e-a819-a67ae9364a97__GenAI_Assistant-Deployment/conf/quartz_data.xml with systemId: /home/vitria/oi/runtime/apps/_appdata_job__prj_vt0e651f92-7297-4f1e-a819-a67ae9364a97__GenAI_Assistant-Deployment/conf/quartz_data.xml
2023-12-26  07:21:54,526    INFO    VIAXMLSchedulingDataProcessor   Adding 1 jobs, 1 triggers.
2023-12-26  07:21:54,531    INFO    VIAXMLSchedulingDataProcessor   Not overwriting existing job: main.CheckSlack
2023-12-26  07:21:54,536    INFO    VIAXMLSchedulingDataProcessor   Not overwriting existing trigger: main.once_CheckSlack
2023-12-26  07:21:54,536    INFO    QuartzScheduler Scheduler vt0e651f92-7297-4f1e-a819-a67ae9364a97__GenAI_Assistant$Deployment_$_NON_CLUSTERED started.
2023-12-26  07:21:54,538    INFO    VIASchedulerLauncher    ------- Started Scheduler -----------------
2023-12-26  07:21:54,538    INFO    VIASchedulerLauncher    Executed 0 jobs.
2023-12-26  07:21:54,571    INFO    BaseJob main.CheckSlack is triggered.
2023-12-26  07:21:54,767    INFO    GenAIJobScheduler   Start --->
2023-12-26  07:21:54,768    INFO    GenAIJobScheduler   Loading email/rules for GenAI_Assistant from presentation server, wait ......
2023-12-26  07:21:54,785    INFO    RestClientUtil  Create trustAll certificate manager.
2023-12-26  07:21:55,264    INFO    GenAIJobScheduler   Finish the loading of email/rules, spend 497 ms.
2023-12-26  07:21:55,264    INFO    GenAIJobScheduler   meta: xxx
2023-12-26  07:21:56,595    INFO    SocketModeClient    New session is open (session id: 416022d2-1c2a-4d01-834a-2cc1a4832a8a)
2023-12-26  07:22:54,408    INFO    SlackReceiver   [satisfy] Receiving message 'show me list of incidents between 9am and 1pm' from U069MA8HE1W
2023-12-26  07:22:54,408    INFO    SlackReceiver   Thread pool, active threads: 0, core threads: 80, queue size: 0
2023-12-26  07:22:54,409    INFO    SlackReceiver   Put message into queue, queue size: 0
2023-12-26  07:22:54,410    INFO    Rule    Execute {"action":"ask","agent":"bedrock","prompt":"switcher","input":"$message.current","output":"switcher_output"}
2023-12-26  07:22:54,412    INFO    AskAgentAction  [execute][bedrock/switcher] --> show me list of incidents between 9am and 1pm
seratch commented 8 months ago

Thanks for your reply. If that's the case, it could possibly be a bug / lack of consideration for a specific scenario on the underlying WebSocket client library, but I am not sure about the details. Go ahead with a more robust healthcheck to keep your app running more stably!

oppokui commented 8 months ago

@seratch, it is difficult to debug the code as the disconnection happen every 5 hours. I want to reproduce this issue and verify my code change more quickly.

I saw "https://api.slack.com/apis/connections/socket#connections" mentioned one uri parameter "&debug_reconnects=true" to shorten the session to 6 minutes. I tried it, but didn't work.

In buildSocketModeClientFactory() method of SocketModeApp.java, I add one line to append the parameter after client is created. Recompile and test. The wss URL is changed at connecting, but sounds not working.

client.setWssUri(new URI(client.getWssUri().toString() + "&debug_reconnects=true"));

Do you have any better idea?

oppokui commented 8 months ago

"&debug_reconnects=true" actually works well, but I can't reproduce the problem after adding this parameter in bolt-socket-mode. The following diff is the patch to expose flag in SocketModeApp.java to control the debug_reconnects. The session will be recreated every 5 or 6 minutes.

diff --git a/bolt-socket-mode/src/main/java/com/slack/api/bolt/socket_mode/SocketModeApp.java b/bolt-socket-mode/src/main/java/com/slack/api/bolt/socket_mode/SocketModeApp.java
index 57eb6fa8..bdd66281 100644
--- a/bolt-socket-mode/src/main/java/com/slack/api/bolt/socket_mode/SocketModeApp.java
+++ b/bolt-socket-mode/src/main/java/com/slack/api/bolt/socket_mode/SocketModeApp.java
@@ -174,18 +174,23 @@ public class SocketModeApp {
     // -------------------------------------------

     public void start() throws Exception {
-        run(true);
+        run(true, false);
     }

     public void startAsync() throws Exception {
-        run(false);
+        run(false, false);
     }

     public void run(boolean blockCurrentThread) throws Exception {
+        run(blockCurrentThread, false);
+    }
+
+    public void run(boolean blockCurrentThread, boolean debug) throws Exception {
         this.app.start();
         if (this.client == null) {
             this.client = clientFactory.get();
         }
+        this.client.setDebugMode(debug);
         if (this.isClientStopped()) {
             this.client.connectToNewEndpoint();
         } else {
diff --git a/slack-api-client/src/main/java/com/slack/api/socket_mode/SocketModeClient.java b/slack-api-client/src/main/java/com/slack/api/socket_mode/SocketModeClient.java
index 8891d3e6..a5c65f90 100644
--- a/slack-api-client/src/main/java/com/slack/api/socket_mode/SocketModeClient.java
+++ b/slack-api-client/src/main/java/com/slack/api/socket_mode/SocketModeClient.java
@@ -62,7 +62,11 @@ public interface SocketModeClient extends Closeable {
      */
     default void connectToNewEndpoint() throws IOException {
         try {
-            setWssUri(new URI(getSlack().issueSocketModeUrl(getAppToken())));
+            if(getDebugMode()) {
+                setWssUri(new URI(getSlack().issueSocketModeUrl(getAppToken()) + "&debug_reconnects=true"));
+            } else {
+                setWssUri(new URI(getSlack().issueSocketModeUrl(getAppToken())));
+            }
             connect();
         } catch (URISyntaxException e) {
             throw new IOException(e);
@@ -112,6 +116,9 @@ public interface SocketModeClient extends Closeable {

     void setWssUri(URI wssUri);

+    boolean getDebugMode();
+    void setDebugMode(boolean mode);
+
     /**
      * Tries to reconnect to the Socket Mode server if true.
      */
diff --git a/slack-api-client/src/main/java/com/slack/api/socket_mode/impl/SocketModeClientJavaWSImpl.java b/slack-api-client/src/main/java/com/slack/api/socket_mode/impl/SocketModeClientJavaWSImpl.java
index 618c3915..76525c9f 100644
--- a/slack-api-client/src/main/java/com/slack/api/socket_mode/impl/SocketModeClientJavaWSImpl.java
+++ b/slack-api-client/src/main/java/com/slack/api/socket_mode/impl/SocketModeClientJavaWSImpl.java
@@ -165,7 +165,11 @@ public class SocketModeClientJavaWSImpl implements SocketModeClient {
     @Override
     public void connectToNewEndpoint() throws IOException {
         try {
-            setWssUri(new URI(getSlack().issueSocketModeUrl(getAppToken())));
+            if(getDebugMode()) {
+                setWssUri(new URI(getSlack().issueSocketModeUrl(getAppToken()) + "&debug_reconnects=true"));
+            } else {
+                setWssUri(new URI(getSlack().issueSocketModeUrl(getAppToken())));
+            }
         } catch (URISyntaxException e) {
             throw new IOException(e);
         }
@@ -289,6 +293,15 @@ public class SocketModeClientJavaWSImpl implements SocketModeClient {
     }

     // ----------------------------------------------------
+    boolean debugMode = false;
+    @Override
+    public boolean getDebugMode() {
+        return debugMode;
+    }
+    @Override
+    public void setDebugMode(boolean mode) {
+        debugMode = mode;
+    }

     @Override
     public Slack getSlack() {
diff --git a/slack-api-client/src/main/java/com/slack/api/socket_mode/impl/SocketModeClientTyrusImpl.java b/slack-api-client/src/main/java/com/slack/api/socket_mode/impl/SocketModeClientTyrusImpl.java
index 1ff76ef0..984e62ee 100644
--- a/slack-api-client/src/main/java/com/slack/api/socket_mode/impl/SocketModeClientTyrusImpl.java
+++ b/slack-api-client/src/main/java/com/slack/api/socket_mode/impl/SocketModeClientTyrusImpl.java
@@ -297,6 +297,16 @@ public class SocketModeClientTyrusImpl implements SocketModeClient {
         }
     }

+    boolean debugMode = false;
+    @Override
+    public boolean getDebugMode() {
+        return debugMode;
+    }
+    @Override
+    public void setDebugMode(boolean mode) {
+        debugMode = mode;
+    }
+
     // ----------------------------------------------------

     @Override

After it, I can use the following simple code to enable the debug_reconnects, but can't reproduce the original problem. It sounds the issue only happen in long running connection expiration.

App app = new App(AppConfig.builder().singleTeamBotToken(botToken).build());
app.event(AppMentionEvent.class, new BoltEventHandler<AppMentionEvent> () {
   ...
});
new SocketModeApp(SLACK_APP_TOKEN, app).run(true, true);

After reviewing the code, I saw there are too many paths to handle session recreation:

a) When slack server sends "disconnnect" message before session timeout/expired (5 hours), it will trigger connectToNewEndpoint() in SocketModeClient.processMessage(). A new session will be created and old session will be closed. It is the main happy path in my long-running case.

b) From the top, SocketModeClient.initializeSessionMonitorExecutor() will invoke maintainCurrentSession() every 5 seconds. If "verifyConnection()" return false, it will create a new Session. It is the session monitor thread, to handle unhappy case like network issue. It makes sense.

c) on Session.onClose() path, SocketModeClient.runCloseListenersAndAutoReconnectAsNecessary() will be triggered. After invoking each CloseListener, it will try to connectToNewEndpoint additionally. I don't understand this logic. It sounds duplicated which will cause race condition in multi threads with #a/#b. An old session is closed but it try to create a new session. A new session was already created in #a or #b but maybe not ready yet, onClose() logic on old session will still check the currentSession in verifyConnection(), if return false, it will recreate a new session.

The issue only happened when #a is triggered. If I shorten the expiration time from 5 hours to 6 minutes, it won't happen easily. #c is buggy, but I don't have enough time to prove it. I attach the fix of #c as follows.

diff --git a/slack-api-client/src/main/java/com/slack/api/socket_mode/SocketModeClient.java b/slack-api-client/src/main/java/com/slack/api/socket_mode/SocketModeClient.java
index 8891d3e6..a779e2fa 100644
--- a/slack-api-client/src/main/java/com/slack/api/socket_mode/SocketModeClient.java
+++ b/slack-api-client/src/main/java/com/slack/api/socket_mode/SocketModeClient.java

@@ -256,13 +263,13 @@ public interface SocketModeClient extends Closeable {
         for (WebSocketCloseListener listener : getWebSocketCloseListeners()) {
             listener.handle(code, reason);
         }
-        if (isAutoReconnectEnabled() && !verifyConnection()) {
+        /*if (isAutoReconnectEnabled() && !verifyConnection()) {
             try {
                 connectToNewEndpoint();
             } catch (IOException e) {
                 getLogger().error("Failed to reconnect to the Socket Mode server: {}", e.getMessage(), e);
             }
-        }
+        }*/
     }

My solution for this issue is to explicitly recreate session every 2 hours, so slack server won't send "disconnect" message to trigger the problem. I will run it for 2 days as longevity test. If it works, I will use this way.

public class SlackMonitor {

    private static Log log = LogFactory.getLog(SlackMonitor.class);

    private SocketModeApp socketApp;
    public SlackMonitor(SocketModeApp socketApp) {
        this.socketApp = socketApp;
    }
    public void watch() {
        Timer timer = new Timer();
        long delay = 60 * 2 * 60 * 1000; // 2 hours
        timer.schedule(new TimerTask() {
            public void run() {
                try {
                    log.info("Reconnect slack server automatically every 2 hours in VIA side");
                    socketApp.getClient().connectToNewEndpoint();
                } catch (Exception e) {
                    log.warn("Fail to close slack socket app " + e.getMessage());
                }
            }
        }, delay, delay);
    }
}
    SocketModeApp socketApp = new SocketModeApp(appToken, app);
    new SlackMonitor(socketApp).watch();
    socketApp.start();
oppokui commented 8 months ago

Another possible solution could be to run multiple instances with health check monitoring. You can check if SocketModeApp#getClient()#verifyConnection() returns true. If it returns false, you monitoring system can terminate the process and spin up a new one. You can run up to 10 instances for a Socket Mode app.

SocketModeClient already had a monitor thread to check "verifyConnection()" and then recreate session. The external app should NOT check it again.

oppokui commented 8 months ago

Sorry, the solution to reconnect every 2 hours won't work. The issue will still happen after 16 hours. The root cause was still there and triggered intermittently.

See the following log. The session was initially created at 12/29 03:13, then recreated every 2 hours. It last 16 hours, then crashed at 19:13 at the same day.

@seratch, can you run longevity test in your side for SocketModeApp.java? It should be easy to reproduce if you run it for one day.

2023-12-29 03:13:40,041 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: 69156389-a5dd-422c-8550-cc4ba93d4d1a)
...
2023-12-29 05:13:39,880 INFO  [com.vitria.app.genai.slack.impl.SlackMonitor] Reconnect slack server automatically every 2 hours in VIA side
2023-12-29 05:13:40,024 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: 438db21b-14ee-4b22-9679-4110c00dfd79)
2023-12-29 05:13:40,126 INFO  [com.slack.api.socket_mode.SocketModeClient] onClose listener is called (session id: 69156389-a5dd-422c-8550-cc4ba93d4d1a, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
2023-12-29 06:02:21,366 INFO  [com.slack.api.socket_mode.SocketModeClient] onClose listener is called (session id: 438db21b-14ee-4b22-9679-4110c00dfd79, reason: Closed abnormally.)
2023-12-29 06:02:21,522 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: 28f8b032-28b7-4ceb-9bb9-15d2adb17695)
2023-12-29 07:13:39,880 INFO  [com.vitria.app.genai.slack.impl.SlackMonitor] Reconnect slack server automatically every 2 hours in VIA side
2023-12-29 07:13:40,023 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: e836cedd-fe4d-4279-bfd6-4a658042e169)
2023-12-29 07:13:40,126 INFO  [com.slack.api.socket_mode.SocketModeClient] onClose listener is called (session id: 28f8b032-28b7-4ceb-9bb9-15d2adb17695, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
2023-12-29 09:13:39,880 INFO  [com.vitria.app.genai.slack.impl.SlackMonitor] Reconnect slack server automatically every 2 hours in VIA side
2023-12-29 09:13:40,010 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: f06d8e9b-41a5-49c9-9c31-3133d8818344)
2023-12-29 09:13:40,113 INFO  [com.slack.api.socket_mode.SocketModeClient] onClose listener is called (session id: e836cedd-fe4d-4279-bfd6-4a658042e169, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
2023-12-29 11:13:39,880 INFO  [com.vitria.app.genai.slack.impl.SlackMonitor] Reconnect slack server automatically every 2 hours in VIA side
2023-12-29 11:13:40,047 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: a98994ec-5722-4951-be97-cd07a5cf56cd)
2023-12-29 11:13:40,149 INFO  [com.slack.api.socket_mode.SocketModeClient] onClose listener is called (session id: f06d8e9b-41a5-49c9-9c31-3133d8818344, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
2023-12-29 13:13:39,880 INFO  [com.vitria.app.genai.slack.impl.SlackMonitor] Reconnect slack server automatically every 2 hours in VIA side
2023-12-29 13:13:40,016 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: b96b4bc5-3c53-410e-9971-cd2cd49eb0d3)
2023-12-29 13:13:40,118 INFO  [com.slack.api.socket_mode.SocketModeClient] onClose listener is called (session id: a98994ec-5722-4951-be97-cd07a5cf56cd, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
2023-12-29 15:13:39,880 INFO  [com.vitria.app.genai.slack.impl.SlackMonitor] Reconnect slack server automatically every 2 hours in VIA side
2023-12-29 15:13:40,023 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: 06c11528-cbe3-49d9-80a0-722296968793)
2023-12-29 15:13:40,124 INFO  [com.slack.api.socket_mode.SocketModeClient] onClose listener is called (session id: b96b4bc5-3c53-410e-9971-cd2cd49eb0d3, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
2023-12-29 17:13:39,881 INFO  [com.vitria.app.genai.slack.impl.SlackMonitor] Reconnect slack server automatically every 2 hours in VIA side
2023-12-29 17:13:40,038 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: 12da201f-1f8b-4da8-ab02-004f1ef3f0d5)
2023-12-29 17:13:40,142 INFO  [com.slack.api.socket_mode.SocketModeClient] onClose listener is called (session id: 06c11528-cbe3-49d9-80a0-722296968793, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
2023-12-29 19:13:39,886 INFO  [com.vitria.app.genai.slack.impl.SlackMonitor] Reconnect slack server automatically every 2 hours in VIA side
2023-12-29 19:13:40,029 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: 92fdbd2a-5d57-48e3-89e4-2a50fb82dbdd)
2023-12-29 19:13:40,131 INFO  [com.slack.api.socket_mode.SocketModeClient] onClose listener is called (session id: 12da201f-1f8b-4da8-ab02-004f1ef3f0d5, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
2023-12-29 19:13:40,138 INFO  [com.slack.api.socket_mode.SocketModeClient] onClose listener is called (session id: 92fdbd2a-5d57-48e3-89e4-2a50fb82dbdd, reason: Closed abnormally.)
2023-12-29 19:13:40,212 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: 2fe6658a-24c4-4bb6-9ff7-5d4fe74201be)
2023-12-29 19:13:40,226 INFO  [com.slack.api.socket_mode.SocketModeClient] onClose listener is called (session id: 2fe6658a-24c4-4bb6-9ff7-5d4fe74201be, reason: Closed abnormally.)
2023-12-29 19:13:40,238 WARN  [com.slack.api.socket_mode.SocketModeClient] Failed to send a ping message (session id: 2fe6658a-24c4-4bb6-9ff7-5d4fe74201be, error: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@53e7b2c rejected from java.util.concurrent.ScheduledThreadPoolExecutor@4df208f9[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 4])
2023-12-29 19:13:40,281 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: 9e594966-e209-443c-8d86-e327ac3c35ab)
2023-12-29 19:13:40,297 INFO  [com.slack.api.socket_mode.SocketModeClient] onClose listener is called (session id: 9e594966-e209-443c-8d86-e327ac3c35ab, reason: Closed abnormally.)
2023-12-29 19:13:40,310 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: 977ca833-6969-4c74-b3c5-a7747aab9695)
2023-12-29 19:13:40,324 INFO  [com.slack.api.socket_mode.SocketModeClient] onClose listener is called (session id: 977ca833-6969-4c74-b3c5-a7747aab9695, reason: Closed abnormally.)
2023-12-29 19:13:40,356 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: f762a28a-bf18-48da-9ac8-d05fb608ff0a)
2023-12-29 19:13:40,380 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: 671bdcf6-c07e-4962-89ec-68ac3297a71b)
2023-12-29 19:13:40,392 INFO  [com.slack.api.socket_mode.SocketModeClient] onClose listener is called (session id: f762a28a-bf18-48da-9ac8-d05fb608ff0a, reason: Closed abnormally.)
2023-12-29 19:13:40,410 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: 67c1c34e-d2a7-46eb-a90a-c3603374c6a7)
2023-12-29 19:13:40,454 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: 7d7b524d-3a80-46c8-a2b4-5fc259346e51)
2023-12-29 19:13:40,502 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: b128be88-46ec-4aae-95ce-ee8115f06298)
2023-12-29 19:13:40,533 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: 4063f1cc-ad64-458c-85db-d18d7f5f77bf)
watertree commented 7 months ago

I am experiencing the same problem on a bot I am running in Google Cloud. It is running bolt-socket-mode 1.16.0 and was running flawlessly for about two years. Since the beginning of October 2023 it has been prone to this issue -- I had suspected that something in the infrastructure of gcloud or slack could have changed where the fault-tolerance of the underlying libraries can't handle disconnections well. Since @oppokui is running in AWS I'm wondering if there was some change in Slack infrastructure (like on an L7 proxy)

filmaj commented 7 months ago

Hey everyone, sorry for the lack of response from Slack over the holidays, we were all, well, on holidays.

Given the multiple reports from different people about this same issue (see also #1223), I will raise this internally with our real-time server team (responsible for operating the websocket endpoints). This will likely take some time, so your patience is appreciated 🙏

filmaj commented 7 months ago

The one concerning and consistent fact in this issue and in #1223 is that the connection is closed due to an 'abnormal closure'. I believe that this happens when no explicit Close frame is sent by a peer - that is, some manner of underlying network connection problem caused the connection to be severed. Given the variety of underlying infrastructure the app deployments are in where these error reports have come from (both AWS and Google hosted Slack apps), I am worried that some lower level network infrastructure change on Slack's side has introduced this intermittent problem. I will provide this information to our RTS team and begin an investigation.

oppokui commented 7 months ago

Thank you for the comment, @watertree and @filmaj. I have more update on this issue.

I mentioned 3 paths of session recreating in previous comments, and then I start to verify #c (Session.onClose event): Whether the issue will be gone if I comment out "connectToNewEndpoint()" at onClose() handler.

diff --git a/slack-api-client/src/main/java/com/slack/api/socket_mode/SocketModeClient.java b/slack-api-client/src/main/java/com/slack/api/socket_mode/SocketModeClient.java
index 8891d3e6..1dd3b62b 100644
--- a/slack-api-client/src/main/java/com/slack/api/socket_mode/SocketModeClient.java
+++ b/slack-api-client/src/main/java/com/slack/api/socket_mode/SocketModeClient.java
@@ -256,13 +256,15 @@ public interface SocketModeClient extends Closeable {
         for (WebSocketCloseListener listener : getWebSocketCloseListeners()) {
             listener.handle(code, reason);
         }
-        if (isAutoReconnectEnabled() && !verifyConnection()) {
-            try {
-                connectToNewEndpoint();
-            } catch (IOException e) {
-                getLogger().error("Failed to reconnect to the Socket Mode server: {}", e.getMessage(), e);
-            }
-        }
+        // https://github.com/slackapi/java-slack-sdk/issues/1256
+        // https://github.com/slackapi/java-slack-sdk/issues/1223
+        //if (isAutoReconnectEnabled() && !verifyConnection()) {
+        //    try {
+        //        connectToNewEndpoint();
+        //    } catch (IOException e) {
+        //        getLogger().error("Failed to reconnect to the Socket Mode server: {}", e.getMessage(), e);
+        //    }
+        //}
     }

     // ---------------------------------

I comment out the code and re-compile to get a new jar: slack-api-client-1.36.1.jar. Then re-run socket bolt app. I still have the client logic to force the reconnection every 2 hours (it won't change anything for this bug).

A good news is that the app run for more than 4 days, no problem any more. See the following log file: app.2023-12-30.log app.2023-12-31.log app.2024-01-01.log app.2024-01-02.log app.log

When analyzing the logs, I can see session recreation will just have 2 lines most of time, one "New session is open xxx", another "onClose listener xxx".

2023-12-31 11:47:37,936 INFO  [SlackMonitor] Reconnect slack server automatically every 2 hours in VIA side
2023-12-31 11:47:38,067 INFO  [SocketModeClient] New session is open (session id: 2ea8a543-86ba-4aa5-984c-7a2fae8da376)
2023-12-31 11:47:38,168 INFO  [SocketModeClient] onClose listener is called (session id: b680f3df-ca57-470a-93f7-e683809eabf2, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
2023-12-31 13:47:37,936 INFO  [SlackMonitor] Reconnect slack server automatically every 2 hours in VIA side
2023-12-31 13:47:38,089 INFO  [SocketModeClient] New session is open (session id: 04beaf4b-ab07-40e6-8f45-fc8c767cc5ee)
2023-12-31 13:47:38,191 INFO  [SocketModeClient] onClose listener is called (session id: 2ea8a543-86ba-4aa5-984c-7a2fae8da376, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
2023-12-31 15:47:37,936 INFO  [SlackMonitor] Reconnect slack server automatically every 2 hours in VIA side
2023-12-31 15:47:38,084 INFO  [SocketModeClient] New session is open (session id: 33bbcb58-1fcd-4d14-bc95-c70c094519c8)
2023-12-31 15:47:38,187 INFO  [SocketModeClient] onClose listener is called (session id: 04beaf4b-ab07-40e6-8f45-fc8c767cc5ee, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)

But sometime, the session creation is not complete and retry several times, I called it "shaking" (not recreating a session once and done, but fail then recreate, and fail again and then recreate, until the session is completely created). See log around 2024-1-1 11:47, my SlackMonitor trigger a session recreation (every 2 hours), there were 5 sessions created but closed abnormally in the same second.

I assume this "shaking" sessions will cause race-condition in multi thread before my change: onClose() code will also try to recreate a new session. Then there are lot of sessions re-created by different logic and jump to the dead loop. The rate-limit error in previous discussion will be raised as slack server will prevent so quick session recreation, then cause the client retry worse.

2024-01-01 11:47:37,939 INFO  [SlackMonitor] Reconnect slack server automatically every 2 hours in VIA side
2024-01-01 11:47:38,095 INFO  [SocketModeClient] New session is open (session id: b3383e62-2d05-4c7f-91f7-062a53f9f47d)
2024-01-01 11:47:38,197 INFO  [SocketModeClient] onClose listener is called (session id: af9d5979-a71f-46c9-b2c9-2b215f9c94d9, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
2024-01-01 11:47:38,200 INFO  [SocketModeClient] onClose listener is called (session id: b3383e62-2d05-4c7f-91f7-062a53f9f47d, reason: Closed abnormally.)
2024-01-01 11:47:38,253 INFO  [SocketModeClient] New session is open (session id: 70829926-d9b2-4398-a833-c4d4d1c3132b)
2024-01-01 11:47:38,266 INFO  [SocketModeClient] onClose listener is called (session id: 70829926-d9b2-4398-a833-c4d4d1c3132b, reason: Closed abnormally.)
2024-01-01 11:47:38,361 INFO  [SocketModeClient] New session is open (session id: dca504f5-3b0d-48c1-8007-7aa840931415)
2024-01-01 11:47:38,373 INFO  [SocketModeClient] onClose listener is called (session id: dca504f5-3b0d-48c1-8007-7aa840931415, reason: Closed abnormally.)
2024-01-01 11:47:38,426 INFO  [SocketModeClient] New session is open (session id: 28a70c00-d8c0-496b-9234-c48338330f9d)
2024-01-01 11:47:38,437 INFO  [SocketModeClient] onClose listener is called (session id: 28a70c00-d8c0-496b-9234-c48338330f9d, reason: Closed abnormally.)
2024-01-01 11:47:38,498 INFO  [SocketModeClient] New session is open (session id: 7a52155e-39e0-48c2-82b5-6d435740ed32)
2024-01-01 11:47:38,512 INFO  [SocketModeClient] onClose listener is called (session id: 7a52155e-39e0-48c2-82b5-6d435740ed32, reason: Closed abnormally.)
2024-01-01 11:47:38,576 INFO  [SocketModeClient] New session is open (session id: f3ac468c-f524-4629-9add-6bbdd063022c)

In summary, the session recreation in SocketModeClient.runCloseListenersAndAutoReconnectAsNecessary() is duplicated and the root cause of both #1223 and #1256.

@watertree, can you try this workaround in your case? Hope it will resove your issue too. @filmaj , can you review the code change and also run longevity in your testing env? hope the fix will be availabe in v1.36.2 or v1.37.0.

filmaj commented 7 months ago

@oppokui I have started a longevity test internally on an AWS EC2 instance w/ the latest released bolt-socket-mode dependency and using the Tyrus websocket backend. I will report back tomorrow with the results to see if the abnormal termination can be reproduced. If I can reproduce the issue, then at that point I can more closely examine your suggested code changes to see if it can help address the issue.

Alternatively, if a test case can be crafted that exhibits the bad reconnection behaviour, that would suffice as well.

Related, the RTS team at Slack has made a change for some underlying infrastructure powering the websocket servers at Slack in the last hour. We are not 100% sure that it will address the abnormal connection terminations, but figured I would mention it.

watertree commented 7 months ago

I definitely won't have the bandwidth to patch for the workaround since my app hasn't been deployed for months and I just put monitoring that should effectively mitigate the problem. However I will note that around the same time frame you mentioned your infrastructure team made a change I saw a blip and it almost immediately recovered (unlike the usual pattern of permanently getting caught in a loop). In the visualization below I am counting the number of log entries have the text "New session is open" in a 1 minute period. Around 1:22PM Eastern US the threshold of 5 was crossed and it immediately returned. Just thought I should mention this since if your long running tests are successful the changes could have been a contributing factor.

image

filmaj commented 7 months ago

Thanks a lot for that data point! That is very helpful. 🤞 that change had a positive impact.

filmaj commented 7 months ago

Update on my end: after about 24 hours running a bolt-java app using socket mode in AWS us-east-1 on an EC2 instance, I was unable to reproduce the abnormal connection closure. My app did reconnect a few times over the course of the 24 hours, but was able to immediately reconnect and recover. Here are the INFO level logs of the past 24 hours:

Successfully started process 'command '/usr/lib/jvm/java-18-amazon-corretto/bin/java''
[Grizzly(1)] INFO com.slack.api.socket_mode.SocketModeClient - New session is open (session id: bb16d8f0-b32f-4ebc-ac25-a3b1f6af120c)
[Grizzly(1)] INFO com.slack.api.socket_mode.SocketModeClient - New session is open (session id: 8c356581-5840-4faa-855a-fae1e7633501)
[socket-mode-session-cleaner-worker-57] INFO com.slack.api.socket_mode.SocketModeClient - onClose listener is called (session id: bb16d8f0-b32f-4ebc-ac25-a3b1f6af120c, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
[Grizzly(1)] INFO com.slack.api.socket_mode.SocketModeClient - New session is open (session id: a2394c1c-29c8-4968-9a2e-19be87f590e5)
[socket-mode-session-cleaner-worker-76] INFO com.slack.api.socket_mode.SocketModeClient - onClose listener is called (session id: 8c356581-5840-4faa-855a-fae1e7633501, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
[Grizzly(1)] INFO com.slack.api.socket_mode.SocketModeClient - New session is open (session id: 148fc07c-b45e-4d82-b45e-d75f8ac21ae1)
[socket-mode-session-cleaner-worker-96] INFO com.slack.api.socket_mode.SocketModeClient - onClose listener is called (session id: a2394c1c-29c8-4968-9a2e-19be87f590e5, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
[Grizzly(1)] INFO com.slack.api.socket_mode.SocketModeClient - New session is open (session id: 1517d3bc-eaaa-4cc1-b16d-e6b945f38f48)
[socket-mode-session-cleaner-worker-57] INFO com.slack.api.socket_mode.SocketModeClient - onClose listener is called (session id: 148fc07c-b45e-4d82-b45e-d75f8ac21ae1, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)

I am hesitant to adjust the reconnection logic without being able to reproduce the abnormal closure. Next week, the main developer of this project, @seratch, should be available to provide his opinion on @oppokui's suggestion on changing this logic. From my perspective, I am not sure it is necessary.

watertree commented 7 months ago

Last 24 hours, looks like it reconnects every 5 hours. There are two blips because I have a deployment with two pods but both recovered. I just put the monitoring just before the infrastructure change so I don't have a valid historical comparison. Also note, in my environment the issue is not super regular (I wouldn't expect it to occur in 1 day of operation, it would start failing anywhere between 5-15 days). So in my case I would need to see no issues for the entire month before I would feel confident that the Slack infrastructure was the root cause -- as I mentioned before it ran for almost two years hardly needing any restarts.

image

filmaj commented 7 months ago

@watertree did the blips you mentioned require you to restart the Bolt process or did it recover on its own?

Interrupted connections when using socket mode are expected. Long lived connections with Slack over socket mode should not be expected to be able to last a very long time. I would personally not view a disconnect even a few times a day as faulty behaviour. As long as Bolt is able to re-establish a connection in a short time, I view that as proper functioning.

If a resilient connection is desired, I would recommend using HTTP over socket mode. With HTTP, any connection between Slack and your app is short-lived, just for the duration of the event delivery. That makes it less susceptible to network connection inconsistencies: the most resilient connection is one that does not exist when it is unused.

If socket mode is desired regardless, another way to make your app more resilient is to operate multiple instances of your app. Slack allows up to 10 simultaneous socket mode connections per app, so Bolt on socket mode can be scaled up to 10 instances. The Slack backend will round-robin deliveries across all open socket mode connections for a single app.

oppokui commented 7 months ago

I will creat a clean EC2 environment and simple slack java code in us-east-1, to reproduce it. Ping you one day later.

filmaj commented 7 months ago

Update on my end: after 2 days of running the app, a few additional logs showed up. In fact, one abnormal closure was reported! However, the app quickly recovered. Here are the new logs compared to what I posted yesterday:

[Grizzly(2)] INFO com.slack.api.socket_mode.SocketModeClient - onClose listener is called (session id: 1517d3bc-eaaa-4cc1-b16d-e6b945f38f48, reason: Closed abnormally.)
[Grizzly(1)] INFO com.slack.api.socket_mode.SocketModeClient - New session is open (session id: acfe9908-ef7d-47fa-9a8b-ff44d153957b)
[Grizzly(1)] INFO com.slack.api.socket_mode.SocketModeClient - New session is open (session id: 30f94da5-90fe-427d-b797-889922ec9c97)
[socket-mode-session-cleaner-worker-96] INFO com.slack.api.socket_mode.SocketModeClient - onClose listener is called (session id: acfe9908-ef7d-47fa-9a8b-ff44d153957b, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
[Grizzly(1)] INFO com.slack.api.socket_mode.SocketModeClient - New session is open (session id: 67ed7770-2072-4eda-94a2-841277df02c3)
[socket-mode-session-cleaner-worker-57] INFO com.slack.api.socket_mode.SocketModeClient - onClose listener is called (session id: 30f94da5-90fe-427d-b797-889922ec9c97, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
[Grizzly(2)] INFO com.slack.api.socket_mode.SocketModeClient - New session is open (session id: a95c4b88-1844-497e-a244-44262c23cf46)
[socket-mode-session-cleaner-worker-76] INFO com.slack.api.socket_mode.SocketModeClient - onClose listener is called (session id: 67ed7770-2072-4eda-94a2-841277df02c3, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
[Grizzly(1)] INFO com.slack.api.socket_mode.SocketModeClient - New session is open (session id: 661a054f-afb6-4f23-8943-273106492076)
[socket-mode-session-cleaner-worker-96] INFO com.slack.api.socket_mode.SocketModeClient - onClose listener is called (session id: a95c4b88-1844-497e-a244-44262c23cf46, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
[Grizzly(1)] INFO com.slack.api.socket_mode.SocketModeClient - New session is open (session id: 7876241b-e366-434d-93e4-f7ab92912195)
[socket-mode-session-cleaner-worker-57] INFO com.slack.api.socket_mode.SocketModeClient - onClose listener is called (session id: 661a054f-afb6-4f23-8943-273106492076, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
watertree commented 7 months ago

@filmaj when I refer to blips there are no repercussions to the service -- things work normally but the logs do indicate there was a reconnection (based on my monitoring). As far as multiple instances, I am running two instances and the problem with having an instance in an infinite loop is that the unhealthy instance still takes roughly half the requests and is unable to reply, so adding more instances is still problematic. I'm hoping that I won't need any changes and things will be back to working prior to sometime in September of last year.

oppokui commented 7 months ago

slack-longevity.tar.gz

I create a very simple code and run it for 5 days. The attached .tar contains the source code, script to compile and run, log4j.properties and bunch of logs.

It was started at 2024-01-05 07:28:44. (app.log.2) It reported the first ratelimited error at 2024-01-07 04:35:33. (app.log.2) Then it continued to report this error until now (app.log.1, app.log).

2024-01-07 04:35:32,977 INFO  [com.slack.api.socket_mode.SocketModeClient] New session is open (session id: 084bb205-494b-45d7-8384-4f1444030ceb)
2024-01-07 04:35:33,078 INFO  [com.slack.api.socket_mode.SocketModeClient] onClose listener is called (session id: e23483a9-b457-4ce0-b69b-a9a73eeb4f31, reason: com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl did it)
2024-01-07 04:35:33,082 INFO  [com.slack.api.socket_mode.SocketModeClient] onClose listener is called (session id: 084bb205-494b-45d7-8384-4f1444030ceb, reason: Closed abnormally.)
2024-01-07 04:35:33,109 ERROR [com.slack.api.methods.impl.TeamIdCache] Failed to call auth.test API (error: status: 429, error: ratelimited, needed: null, provided: null, warning: null)
SlackApiException(response=Response{protocol=http/1.1, code=429, message=Too Many Requests, url=https://slack.com/api/auth.test}, responseBody={"ok":false,"error":"ratelimited"}, error=SlackApiErrorResponse(ok=false, warning=null, error=ratelimited, needed=null, provided=null, httpResponseHeaders=null))
    at com.slack.api.methods.impl.MethodsClientImpl.parseJsonResponseAndRunListeners(MethodsClientImpl.java:3617)
    at com.slack.api.methods.impl.MethodsClientImpl.parseJsonResponseAndRunListeners(MethodsClientImpl.java:3590)
    at com.slack.api.methods.impl.TeamIdCache.lambda$lookupOrResolve$0(TeamIdCache.java:44)
    at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
    at com.slack.api.methods.impl.TeamIdCache.lookupOrResolve(TeamIdCache.java:67)
    at com.slack.api.methods.impl.TeamIdCache.lookupOrResolve(TeamIdCache.java:40)
    at com.slack.api.methods.impl.MethodsClientImpl.postFormWithTokenAndParseResponse(MethodsClientImpl.java:3488)
    at com.slack.api.methods.impl.MethodsClientImpl.appsConnectionsOpen(MethodsClientImpl.java:1301)
    at com.slack.api.methods.impl.MethodsClientImpl.appsConnectionsOpen(MethodsClientImpl.java:1306)
    at com.slack.api.Slack.issueSocketModeUrl(Slack.java:184)
    at com.slack.api.socket_mode.SocketModeClient.connectToNewEndpoint(SocketModeClient.java:65)
    at com.slack.api.socket_mode.SocketModeClient.runCloseListenersAndAutoReconnectAsNecessary(SocketModeClient.java:261)
    at com.slack.api.socket_mode.impl.SocketModeClientTyrusImpl.onClose(SocketModeClientTyrusImpl.java:242)
    at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)

The code was from SocketModeClient.runCloseListenersAndAutoReconnectAsNecessary(). Same with my guess.

It still run there. For the log, it sounds restored (not sure), and the log entry stay at "New session xxx" line. The log file was huge comparing with another impl (below).

I had another slack-app running, with the hotfix of onClose() comment. It run for 10 days, very smoothly, no exception at all. The log file was very small. Screenshot from 2024-01-10 18-21-20

filmaj commented 7 months ago

In my longevity test, I was able to reproduce the ratelimiting error after a few days.

CC @seratch, I think we should take a look at @oppokui's proposed fix.

seratch commented 7 months ago

@oppokui @watertree Thank you very much for your thorough investigation on this issue and we appreciate your patience with the disruption for a few weeks during the winter break. I just submitted a PR based on @oppokui's changes (I've added a new flag option to opt-out just in case): https://github.com/slackapi/java-slack-sdk/pull/1265 The change will be released shortly.

seratch commented 7 months ago

Published a new version (1.37.0 - https://github.com/slackapi/java-slack-sdk/releases/tag/v1.37.0 ) to the Maven Central repo. Please try the latest version out and let us know if there is still anything wrong. Thank you so much again for your time and efforts on this issue.

oppokui commented 7 months ago

Thanks, @seratch. I will use 1.37.0 to verify.

watertree commented 5 months ago

@oppokui @filmaj @seratch just want to confirm that this has been definitively fixed. Two months of zero issues -- thank you all your work on this!