Closed sandsebi closed 1 year ago
It is normal that Tower forces Agent to reconnect from time to time (to revalidate the token). The attached logs seems normal. Did you get any error log after the "INFO - Closed after 0d 0h 5m 0s. [trying to reconnect in 1 minutes]"?
Run the agent like this to enable more detailed logs:
LOGGER_LEVELS_IO_SEQERA_TOWER_AGENT=TRACE tw-agent ...
And post here if you get more logs when it is trying to reconnect.
Tried that. Still the same , no messages after the INFO log you mentioned. Please see below:
11:09:58.427 INFO - Connection to Tower established
11:09:58.427 DEBUG- Websocket session URL: https://api.tower.nf/agent/***/connect
11:14:58.429 INFO - Closed after 0d 0h 5m 0s. [trying to reconnect in 1 minutes]
There's nothing after the last line above.
I've been doing some connectivity tests and I'm not able to reproduce this behavior. Is it still happening to you?
I haven't tested it recently and if you weren't able to reproduce, I'll close this for now and reopen if I see it happening again. Thank you!
I was able to reproduce this issue today. Debug logs for your reference:
$ LOGGER_LEVELS_IO_SEQERA_TOWER_AGENT=TRACE ./tw-agent 52dcaf52-4d9a-4356-8140-100ac738ec57 --work-dir=./work
08:34:57.453 INFO - Established active environments: [cli]
08:34:57.456 INFO - Setting log level 'TRACE' for logger: 'io.seqera-tower-agent'
08:34:57.456 INFO - Setting log level 'TRACE' for logger: 'io.seqera.tower-agent'
08:34:57.456 INFO - Setting log level 'TRACE' for logger: 'io-seqera.tower.agent'
08:34:57.456 INFO - Setting log level 'TRACE' for logger: 'io-seqera-tower.agent'
08:34:57.456 INFO - Setting log level 'TRACE' for logger: 'io.seqera-tower.agent'
08:34:57.456 INFO - Setting log level 'TRACE' for logger: 'io.seqera.tower.agent'
08:34:57.456 INFO - Setting log level 'TRACE' for logger: 'io-seqera.tower-agent'
08:34:57.456 INFO - Setting log level 'TRACE' for logger: 'io-seqera-tower-agent'
08:34:57.461 INFO - Setting log level 'TRACE' for logger: 'io.seqera-tower-agent'
08:34:57.461 INFO - Setting log level 'TRACE' for logger: 'io.seqera.tower-agent'
08:34:57.461 INFO - Setting log level 'TRACE' for logger: 'io-seqera.tower.agent'
08:34:57.461 INFO - Setting log level 'TRACE' for logger: 'io-seqera-tower.agent'
08:34:57.461 INFO - Setting log level 'TRACE' for logger: 'io.seqera-tower.agent'
08:34:57.461 INFO - Setting log level 'TRACE' for logger: 'io.seqera.tower.agent'
08:34:57.461 INFO - Setting log level 'TRACE' for logger: 'io-seqera.tower-agent'
08:34:57.461 INFO - Setting log level 'TRACE' for logger: 'io-seqera-tower-agent'
08:34:57.463 INFO - TOWER AGENT v0.4.3
08:34:57.463 INFO - Compatible with TOWER API v1.8
08:34:57.463 INFO - Connecting as user 'analyser' with default work directory '***'
08:34:57.706 INFO - Connection to Tower established
08:34:57.706 DEBUG- Websocket session URL: https://api.tower.nf/agent/52dcaf52-4d9a-4356-8140-100ac738ec57/connect
08:35:29.597 INFO - Sending response 6b503547-d5d0-48ff-98cc-f14c3c3ec387'
08:35:29.597 TRACE- RESPONSE: CommandResponse[id=6b503547-d5d0-48ff-98cc-f14c3c3ec387; result=true
; exitStatus=0]
08:35:30.064 INFO - Sending response a11f97f9-b42b-4413-8080-410f3d80c21e'
08:35:30.065 TRACE- RESPONSE: CommandResponse[id=a11f97f9-b42b-4413-8080-410f3d80c21e; result=IBM Spectrum LSF Standard 10.1.0.10, Apr 10 2020
Copyright International Business Machines Corp. 1992, 2016.
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
My cluster name is codon
My master name is codon-master-001
; exitStatus=0]
08:35:30.841 INFO - Sending response e398b2e8-5cf9-42de-ad26-932acce517a1'
08:35:30.841 TRACE- RESPONSE: CommandResponse[id=e398b2e8-5cf9-42de-ad26-932acce517a1; result=; exitStatus=0]
08:35:31.518 INFO - Sending response f675b55f-c9f4-4818-906a-aaf595469ea7'
08:35:40.814 INFO - Sending response 5413a240-aad1-4ae2-a3a7-ed24a5cec4e9' [16/1839]
08:35:40.814 TRACE- RESPONSE: CommandResponse[id=5413a240-aad1-4ae2-a3a7-ed24a5cec4e9; result=Job <1048040> is submitted to queue <standard>.
; exitStatus=0]
08:35:54.820 INFO - Sending response 83109aed-27a8-465f-86c7-1e1d4f114550'
08:35:54.820 TRACE- RESPONSE: CommandResponse[id=83109aed-27a8-465f-86c7-1e1d4f114550; result=JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
114079 analyser RUN standard codon-login-04 hl-codon-28-02 /bin/bash May 12 13:59
1048040 analyser RUN standard hl-codon-28-02 hl-codon-01-04 nf-workflow-4Buco8pE9cclKL May 16 09:35
; exitStatus=0]
08:35:57.708 INFO - Sending heartbeat
08:35:57.721 INFO - Received heartbeat
08:36:03.893 INFO - Sending response c9efc382-345d-4252-9954-55f5d74788d4'
08:36:03.893 TRACE- RESPONSE: CommandResponse[id=c9efc382-345d-4252-9954-55f5d74788d4; result=JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
114079 analyser RUN standard codon-login-04 hl-codon-28-02 /bin/bash May 12 13:59
; exitStatus=0]
08:36:04.553 INFO - Sending response f4fd04d7-2df6-4309-9eee-a95e111be97c'
08:36:04.553 TRACE- RESPONSE: CommandResponse[id=f4fd04d7-2df6-4309-9eee-a95e111be97c; result=1
; exitStatus=0]
08:36:05.163 INFO - Sending response dbe1e8ee-534d-426d-943e-880b5ceec562'
08:36:05.163 TRACE- RESPONSE: CommandResponse[id=dbe1e8ee-534d-426d-943e-880b5ceec562; result=TiBFIFggVCBGIEwgTyBXICB+ICB2ZXJzaW9uIDIxLjEwLjYKUHVsbGluZyBzYW5kc2ViaS9jb3Zp
ZC1zZXF1ZW5jZS1hbmFseXNpcy13b3JrZmxvdyAuLi4KIGRvd25sb2FkZWQgZnJvbSBodHRwczov
L2dpdGh1Yi5jb20vc2FuZHNlYmkvY292aWQtc2VxdWVuY2UtYW5hbHlzaXMtd29ya2Zsb3cuZ2l0
CkxhdW5jaGluZyBgc2FuZHNlYmkvY292aWQtc2VxdWVuY2UtYW5hbHlzaXMtd29ya2Zsb3dgIFtp
cnJldmVyZW50X3JpZGVdIC0gcmV2aXNpb246IDIwNWMxYTc2M2UgW3YyLWFwaS1taWdyYXRpb25d
ClNpZ25hbCBhbHJlYWR5IHVzZWQgYnkgVk06IElOVAoKCg==
; exitStatus=0]
08:36:57.708 INFO - Sending heartbeat
08:36:57.716 INFO - Received heartbeat
08:37:57.708 INFO - Sending heartbeat
08:37:57.716 INFO - Received heartbeat
08:38:57.708 INFO - Sending heartbeat
08:38:57.716 INFO - Received heartbeat
08:39:57.708 INFO - Sending heartbeat
08:39:57.716 INFO - Received heartbeat
08:40:57.708 INFO - Sending heartbeat
08:40:57.717 INFO - Received heartbeat
08:41:57.708 INFO - Sending heartbeat
08:41:57.716 INFO - Received heartbeat
08:42:57.708 INFO - Sending heartbeat
08:42:57.716 INFO - Received heartbeat
08:43:57.708 INFO - Sending heartbeat
08:43:57.716 INFO - Received heartbeat
08:44:57.708 INFO - Sending heartbeat
08:44:57.718 INFO - Received heartbeat
08:45:57.708 INFO - Sending heartbeat
08:45:57.716 INFO - Received heartbeat
08:46:57.708 INFO - Sending heartbeat
08:46:57.717 INFO - Received heartbeat
08:47:57.708 INFO - Sending heartbeat
08:47:57.716 INFO - Received heartbeat
08:48:57.708 INFO - Sending heartbeat
08:48:57.719 INFO - Received heartbeat
08:49:57.708 INFO - Sending heartbeat
08:49:57.718 INFO - Received heartbeat
08:50:57.708 INFO - Sending heartbeat
08:50:57.716 INFO - Received heartbeat
08:51:57.708 INFO - Sending heartbeat
08:51:57.716 INFO - Received heartbeat
08:52:57.708 INFO - Sending heartbeat
08:52:57.716 INFO - Received heartbeat
08:53:57.708 INFO - Sending heartbeat
08:53:57.724 INFO - Received heartbeat
08:54:57.708 INFO - Sending heartbeat
08:54:57.716 INFO - Received heartbeat
08:55:57.708 INFO - Sending heartbeat
08:55:57.717 INFO - Received heartbeat
08:56:57.708 INFO - Sending heartbeat
08:56:57.716 INFO - Received heartbeat
08:57:57.708 INFO - Sending heartbeat
08:57:57.716 INFO - Received heartbeat
08:58:57.708 INFO - Sending heartbeat
08:58:57.723 INFO - Received heartbeat
08:59:57.708 INFO - Sending heartbeat
08:59:57.716 INFO - Received heartbeat
09:00:57.708 INFO - Sending heartbeat
09:00:57.717 INFO - Received heartbeat
09:01:57.708 INFO - Sending heartbeat
09:01:57.716 INFO - Received heartbeat
09:02:57.708 INFO - Sending heartbeat
09:02:57.716 INFO - Received heartbeat
09:03:57.708 INFO - Sending heartbeat
09:03:57.716 INFO - Received heartbeat
09:04:57.708 INFO - Sending heartbeat
09:04:57.708 INFO - Closing to reauthenticate the session
09:04:57.792 INFO - Connection to Tower established
09:04:57.792 DEBUG- Websocket session URL: https://api.tower.nf/agent/52dcaf52-4d9a-4356-8140-100ac738ec57/connect
09:09:57.792 INFO - Closed after 0d 0h 5m 0s. [trying to reconnect in 1 minutes]
Some questions:
Yes it stays like this. I waited for about 2 minutes after this and started another run, but I got an error saying "No Agent Running on our Cluster. Please try again." So I killed this process and started another.
I see that it happened again today:
13:11:27.840 INFO - Sending heartbeat
13:11:27.848 INFO - Received heartbeat
13:12:27.834 INFO - Closing to reauthenticate the session
13:12:27.840 INFO - Sending heartbeat
13:12:27.901 INFO - Connection to Tower established
13:12:27.901 DEBUG- Websocket session URL: https://api.tower.nf/agent/52dcaf52-4d9a-4356-8140-100ac738ec57/connect
13:17:27.902 INFO - Closed after 0d 0h 5m 0s. [trying to reconnect in 1 minutes]
We've done a new release that tries to minimize these re-connection issues. Check if it works for you. https://github.com/seqeralabs/tower-agent/releases/tag/v0.4.5
I'm running tower-agent on LSF with nextflow version 21.10.6. I start the tw-agent script using a
tmux
session The following logs are seen: