seqeralabs / tower-agent

Agent app that allows connecting remote batch schedulers to Tower Cloud
https://cloud.tower.nf
Apache License 2.0
8 stars 4 forks source link

Tower agent reconnection fails after 5 mins #42

Closed sandsebi closed 1 year ago

sandsebi commented 2 years ago

I'm running tower-agent on LSF with nextflow version 21.10.6. I start the tw-agent script using a tmux session The following logs are seen:

08:59:16.033 INFO - Established active environments: [cli]
08:59:16.081 INFO - TOWER AGENT v0.4.3
08:59:16.081 INFO - Compatible with TOWER API v1.8
08:59:16.081 INFO - Connecting as user 'analyser' with default work directory '/work2'
08:59:16.468 INFO - Connection to Tower established
08:59:32.540 INFO - Sending response ad1fd492-e506-4acc-a69e-a35db9f5ed99'
08:59:32.959 INFO - Sending response 529f1367-4158-42c7-a0aa-d115bbb15a50'
08:59:33.784 INFO - Sending response f02facb4-519c-4345-9737-b8a58c555be3'
08:59:34.511 INFO - Sending response f3c6988a-be04-4a59-89ad-d9c4c8dabdcf'
08:59:40.702 INFO - Sending response 3d14646c-a757-49d3-bacf-e8ac370320c2'
08:59:52.850 INFO - Sending response 8d4704e6-3a42-4038-8e50-b4427cf0e218'
09:00:04.322 INFO - Sending response d0be0e45-ed80-4858-bde0-08c7f63bebc3'
09:00:16.476 INFO - Sending heartbeat
09:00:16.484 INFO - Received heartbeat
09:00:28.987 INFO - Sending response 8d2d1a07-0e46-4014-9575-bbebbac92c0d'
09:00:29.395 INFO - Sending response 0097ec6e-5340-4c77-bfc0-dec1d9cad017'
09:01:16.476 INFO - Sending heartbeat
09:01:16.485 INFO - Received heartbeat
09:02:16.475 INFO - Sending heartbeat
09:02:16.481 INFO - Received heartbeat
09:03:16.475 INFO - Sending heartbeat
09:03:16.482 INFO - Received heartbeat
09:04:16.475 INFO - Sending heartbeat
09:04:16.481 INFO - Received heartbeat
09:05:16.475 INFO - Sending heartbeat
09:05:16.481 INFO - Received heartbeat
09:06:16.475 INFO - Sending heartbeat
09:06:16.481 INFO - Received heartbeat
09:07:16.475 INFO - Sending heartbeat
09:07:16.481 INFO - Received heartbeat
09:08:16.475 INFO - Sending heartbeat
09:08:16.481 INFO - Received heartbeat
09:09:16.475 INFO - Sending heartbeat
09:09:16.481 INFO - Received heartbeat
09:10:16.475 INFO - Sending heartbeat
09:10:16.481 INFO - Received heartbeat
09:11:16.475 INFO - Sending heartbeat
09:11:16.481 INFO - Received heartbeat
09:12:16.475 INFO - Sending heartbeat
09:12:16.481 INFO - Received heartbeat
09:13:16.475 INFO - Sending heartbeat
09:13:16.481 INFO - Received heartbeat
09:14:16.475 INFO - Sending heartbeat
09:14:16.481 INFO - Received heartbeat
09:15:16.475 INFO - Sending heartbeat
09:15:16.481 INFO - Received heartbeat
09:16:16.475 INFO - Sending heartbeat
09:16:16.481 INFO - Received heartbeat
09:17:16.475 INFO - Sending heartbeat
09:17:16.481 INFO - Received heartbeat
09:18:16.475 INFO - Sending heartbeat
09:18:16.485 INFO - Received heartbeat
09:19:16.475 INFO - Sending heartbeat
09:19:16.481 INFO - Received heartbeat
09:20:16.475 INFO - Sending heartbeat
09:20:16.481 INFO - Received heartbeat
09:21:16.475 INFO - Sending heartbeat
09:21:16.481 INFO - Received heartbeat
09:22:16.475 INFO - Sending heartbeat
09:22:16.481 INFO - Received heartbeat
09:23:16.475 INFO - Sending heartbeat
09:23:16.481 INFO - Received heartbeat
09:24:16.475 INFO - Sending heartbeat
09:24:16.481 INFO - Received heartbeat
09:25:16.475 INFO - Sending heartbeat
09:25:16.481 INFO - Received heartbeat
09:26:16.475 INFO - Sending heartbeat
09:26:16.481 INFO - Received heartbeat
09:27:16.475 INFO - Sending heartbeat
09:27:16.481 INFO - Received heartbeat
09:28:16.475 INFO - Sending heartbeat
09:28:16.481 INFO - Received heartbeat
09:29:16.468 INFO - Closing to reauthenticate the session
09:29:16.475 INFO - Sending heartbeat
09:29:16.561 INFO - Connection to Tower established
09:34:16.562 INFO - Closed after 0d 0h 5m 0s. [trying to reconnect in 1 minutes]
jordeu commented 2 years ago

It is normal that Tower forces Agent to reconnect from time to time (to revalidate the token). The attached logs seems normal. Did you get any error log after the "INFO - Closed after 0d 0h 5m 0s. [trying to reconnect in 1 minutes]"?

Run the agent like this to enable more detailed logs:

LOGGER_LEVELS_IO_SEQERA_TOWER_AGENT=TRACE tw-agent ...

And post here if you get more logs when it is trying to reconnect.

sandsebi commented 2 years ago

Tried that. Still the same , no messages after the INFO log you mentioned. Please see below:

11:09:58.427 INFO - Connection to Tower established
11:09:58.427 DEBUG- Websocket session URL: https://api.tower.nf/agent/***/connect
11:14:58.429 INFO - Closed after 0d 0h 5m 0s. [trying to reconnect in 1 minutes]

There's nothing after the last line above.

jordeu commented 2 years ago

I've been doing some connectivity tests and I'm not able to reproduce this behavior. Is it still happening to you?

sandsebi commented 2 years ago

I haven't tested it recently and if you weren't able to reproduce, I'll close this for now and reopen if I see it happening again. Thank you!

sandsebi commented 2 years ago

I was able to reproduce this issue today. Debug logs for your reference:

$ LOGGER_LEVELS_IO_SEQERA_TOWER_AGENT=TRACE ./tw-agent 52dcaf52-4d9a-4356-8140-100ac738ec57  --work-dir=./work
08:34:57.453 INFO - Established active environments: [cli]
08:34:57.456 INFO - Setting log level 'TRACE' for logger: 'io.seqera-tower-agent'
08:34:57.456 INFO - Setting log level 'TRACE' for logger: 'io.seqera.tower-agent'
08:34:57.456 INFO - Setting log level 'TRACE' for logger: 'io-seqera.tower.agent'
08:34:57.456 INFO - Setting log level 'TRACE' for logger: 'io-seqera-tower.agent'
08:34:57.456 INFO - Setting log level 'TRACE' for logger: 'io.seqera-tower.agent'
08:34:57.456 INFO - Setting log level 'TRACE' for logger: 'io.seqera.tower.agent'
08:34:57.456 INFO - Setting log level 'TRACE' for logger: 'io-seqera.tower-agent'
08:34:57.456 INFO - Setting log level 'TRACE' for logger: 'io-seqera-tower-agent'
08:34:57.461 INFO - Setting log level 'TRACE' for logger: 'io.seqera-tower-agent'
08:34:57.461 INFO - Setting log level 'TRACE' for logger: 'io.seqera.tower-agent'
08:34:57.461 INFO - Setting log level 'TRACE' for logger: 'io-seqera.tower.agent'
08:34:57.461 INFO - Setting log level 'TRACE' for logger: 'io-seqera-tower.agent'
08:34:57.461 INFO - Setting log level 'TRACE' for logger: 'io.seqera-tower.agent'
08:34:57.461 INFO - Setting log level 'TRACE' for logger: 'io.seqera.tower.agent'
08:34:57.461 INFO - Setting log level 'TRACE' for logger: 'io-seqera.tower-agent'
08:34:57.461 INFO - Setting log level 'TRACE' for logger: 'io-seqera-tower-agent'
08:34:57.463 INFO - TOWER AGENT v0.4.3
08:34:57.463 INFO - Compatible with TOWER API v1.8
08:34:57.463 INFO - Connecting as user 'analyser' with default work directory '***'
08:34:57.706 INFO - Connection to Tower established
08:34:57.706 DEBUG- Websocket session URL: https://api.tower.nf/agent/52dcaf52-4d9a-4356-8140-100ac738ec57/connect
08:35:29.597 INFO - Sending response 6b503547-d5d0-48ff-98cc-f14c3c3ec387'
08:35:29.597 TRACE- RESPONSE: CommandResponse[id=6b503547-d5d0-48ff-98cc-f14c3c3ec387; result=true
; exitStatus=0]
08:35:30.064 INFO - Sending response a11f97f9-b42b-4413-8080-410f3d80c21e'
08:35:30.065 TRACE- RESPONSE: CommandResponse[id=a11f97f9-b42b-4413-8080-410f3d80c21e; result=IBM Spectrum LSF Standard 10.1.0.10, Apr 10 2020
Copyright International Business Machines Corp. 1992, 2016.
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

My cluster name is codon
My master name is codon-master-001
; exitStatus=0]
08:35:30.841 INFO - Sending response e398b2e8-5cf9-42de-ad26-932acce517a1'
08:35:30.841 TRACE- RESPONSE: CommandResponse[id=e398b2e8-5cf9-42de-ad26-932acce517a1; result=; exitStatus=0]
08:35:31.518 INFO - Sending response f675b55f-c9f4-4818-906a-aaf595469ea7'
08:35:40.814 INFO - Sending response 5413a240-aad1-4ae2-a3a7-ed24a5cec4e9'                                                                                                                                                                                            [16/1839]
08:35:40.814 TRACE- RESPONSE: CommandResponse[id=5413a240-aad1-4ae2-a3a7-ed24a5cec4e9; result=Job <1048040> is submitted to queue <standard>.
; exitStatus=0]
08:35:54.820 INFO - Sending response 83109aed-27a8-465f-86c7-1e1d4f114550'
08:35:54.820 TRACE- RESPONSE: CommandResponse[id=83109aed-27a8-465f-86c7-1e1d4f114550; result=JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
114079  analyser RUN   standard   codon-login-04 hl-codon-28-02 /bin/bash  May 12 13:59
1048040 analyser RUN   standard   hl-codon-28-02 hl-codon-01-04 nf-workflow-4Buco8pE9cclKL May 16 09:35
; exitStatus=0]
08:35:57.708 INFO - Sending heartbeat
08:35:57.721 INFO - Received heartbeat
08:36:03.893 INFO - Sending response c9efc382-345d-4252-9954-55f5d74788d4'
08:36:03.893 TRACE- RESPONSE: CommandResponse[id=c9efc382-345d-4252-9954-55f5d74788d4; result=JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
114079  analyser RUN   standard   codon-login-04 hl-codon-28-02 /bin/bash  May 12 13:59
; exitStatus=0]
08:36:04.553 INFO - Sending response f4fd04d7-2df6-4309-9eee-a95e111be97c'
08:36:04.553 TRACE- RESPONSE: CommandResponse[id=f4fd04d7-2df6-4309-9eee-a95e111be97c; result=1
; exitStatus=0]
08:36:05.163 INFO - Sending response dbe1e8ee-534d-426d-943e-880b5ceec562'
08:36:05.163 TRACE- RESPONSE: CommandResponse[id=dbe1e8ee-534d-426d-943e-880b5ceec562; result=TiBFIFggVCBGIEwgTyBXICB+ICB2ZXJzaW9uIDIxLjEwLjYKUHVsbGluZyBzYW5kc2ViaS9jb3Zp
ZC1zZXF1ZW5jZS1hbmFseXNpcy13b3JrZmxvdyAuLi4KIGRvd25sb2FkZWQgZnJvbSBodHRwczov
L2dpdGh1Yi5jb20vc2FuZHNlYmkvY292aWQtc2VxdWVuY2UtYW5hbHlzaXMtd29ya2Zsb3cuZ2l0
CkxhdW5jaGluZyBgc2FuZHNlYmkvY292aWQtc2VxdWVuY2UtYW5hbHlzaXMtd29ya2Zsb3dgIFtp
cnJldmVyZW50X3JpZGVdIC0gcmV2aXNpb246IDIwNWMxYTc2M2UgW3YyLWFwaS1taWdyYXRpb25d
ClNpZ25hbCBhbHJlYWR5IHVzZWQgYnkgVk06IElOVAoKCg==
; exitStatus=0]
08:36:57.708 INFO - Sending heartbeat
08:36:57.716 INFO - Received heartbeat
08:37:57.708 INFO - Sending heartbeat
08:37:57.716 INFO - Received heartbeat
08:38:57.708 INFO - Sending heartbeat
08:38:57.716 INFO - Received heartbeat
08:39:57.708 INFO - Sending heartbeat
08:39:57.716 INFO - Received heartbeat
08:40:57.708 INFO - Sending heartbeat
08:40:57.717 INFO - Received heartbeat
08:41:57.708 INFO - Sending heartbeat
08:41:57.716 INFO - Received heartbeat
08:42:57.708 INFO - Sending heartbeat
08:42:57.716 INFO - Received heartbeat
08:43:57.708 INFO - Sending heartbeat
08:43:57.716 INFO - Received heartbeat
08:44:57.708 INFO - Sending heartbeat
08:44:57.718 INFO - Received heartbeat
08:45:57.708 INFO - Sending heartbeat
08:45:57.716 INFO - Received heartbeat
08:46:57.708 INFO - Sending heartbeat
08:46:57.717 INFO - Received heartbeat
08:47:57.708 INFO - Sending heartbeat
08:47:57.716 INFO - Received heartbeat
08:48:57.708 INFO - Sending heartbeat
08:48:57.719 INFO - Received heartbeat
08:49:57.708 INFO - Sending heartbeat
08:49:57.718 INFO - Received heartbeat
08:50:57.708 INFO - Sending heartbeat
08:50:57.716 INFO - Received heartbeat
08:51:57.708 INFO - Sending heartbeat
08:51:57.716 INFO - Received heartbeat
08:52:57.708 INFO - Sending heartbeat
08:52:57.716 INFO - Received heartbeat
08:53:57.708 INFO - Sending heartbeat
08:53:57.724 INFO - Received heartbeat
08:54:57.708 INFO - Sending heartbeat
08:54:57.716 INFO - Received heartbeat
08:55:57.708 INFO - Sending heartbeat
08:55:57.717 INFO - Received heartbeat
08:56:57.708 INFO - Sending heartbeat
08:56:57.716 INFO - Received heartbeat
08:57:57.708 INFO - Sending heartbeat
08:57:57.716 INFO - Received heartbeat
08:58:57.708 INFO - Sending heartbeat
08:58:57.723 INFO - Received heartbeat
08:59:57.708 INFO - Sending heartbeat
08:59:57.716 INFO - Received heartbeat
09:00:57.708 INFO - Sending heartbeat
09:00:57.717 INFO - Received heartbeat
09:01:57.708 INFO - Sending heartbeat
09:01:57.716 INFO - Received heartbeat
09:02:57.708 INFO - Sending heartbeat
09:02:57.716 INFO - Received heartbeat
09:03:57.708 INFO - Sending heartbeat
09:03:57.716 INFO - Received heartbeat
09:04:57.708 INFO - Sending heartbeat
09:04:57.708 INFO - Closing to reauthenticate the session
09:04:57.792 INFO - Connection to Tower established
09:04:57.792 DEBUG- Websocket session URL: https://api.tower.nf/agent/52dcaf52-4d9a-4356-8140-100ac738ec57/connect
09:09:57.792 INFO - Closed after 0d 0h 5m 0s. [trying to reconnect in 1 minutes]
jordeu commented 2 years ago

Some questions:

sandsebi commented 2 years ago

Yes it stays like this. I waited for about 2 minutes after this and started another run, but I got an error saying "No Agent Running on our Cluster. Please try again." So I killed this process and started another.

sandsebi commented 2 years ago

I see that it happened again today:

13:11:27.840 INFO - Sending heartbeat
13:11:27.848 INFO - Received heartbeat
13:12:27.834 INFO - Closing to reauthenticate the session
13:12:27.840 INFO - Sending heartbeat
13:12:27.901 INFO - Connection to Tower established
13:12:27.901 DEBUG- Websocket session URL: https://api.tower.nf/agent/52dcaf52-4d9a-4356-8140-100ac738ec57/connect
13:17:27.902 INFO - Closed after 0d 0h 5m 0s. [trying to reconnect in 1 minutes]
jordeu commented 2 years ago

We've done a new release that tries to minimize these re-connection issues. Check if it works for you. https://github.com/seqeralabs/tower-agent/releases/tag/v0.4.5