simonsobs / socs

Simons Observatory specific OCS agents.
BSD 2-Clause "Simplified" License
12 stars 12 forks source link

Weird timeout behavior when agent already exists #465

Open jlashner opened 1 year ago

jlashner commented 1 year ago

I was having an issue where I was trying to start an agent that already existed, but I just wanted to post the weird timeout / reconnection behavior. Below are all the logs from the ocs agent docker before the agent shut down. Looks like something is going wrong with the reconnect logic:

Attaching to hwp-pmx
hwp-pmx                   | Args: ['--instance-id', 'PMX', '--site-hub', 'ws://localhost:8001/ws', '--site-http', 'http://localhost:8001/call']
hwp-pmx                   | Installed OCS Plugins: ['socs', 'ocs']
hwp-pmx                   | 2023-06-20T17:16:48+0000 Using OCS version 0.10.1
hwp-pmx                   | 2023-06-20T17:16:48+0000 ocs: starting <class 'ocs.ocs_agent.OCSAgent'> @ observatory.PMX
hwp-pmx                   | 2023-06-20T17:16:48+0000 log_file is apparently None
hwp-pmx                   | 2023-06-20T17:16:48+0000 transport connected
hwp-pmx                   | 2023-06-20T17:16:48+0000 session joined: {'authextra': {'x_cb_node': 'f04276140250-1',
hwp-pmx                   |                'x_cb_peer': 'tcp4:172.19.0.1:56562',
hwp-pmx                   |                'x_cb_pid': 16,
hwp-pmx                   |                'x_cb_worker': 'worker001'},
hwp-pmx                   |  'authid': 'KH95-64WQ-YA3N-QWA9-F93A-TNVS',
hwp-pmx                   |  'authmethod': 'anonymous',
hwp-pmx                   |  'authprovider': 'static',
hwp-pmx                   |  'authrole': 'iocs_agent',
hwp-pmx                   |  'realm': 'test_realm',
hwp-pmx                   |  'resumable': False,
hwp-pmx                   |  'resume_token': None,
hwp-pmx                   |  'resumed': False,
hwp-pmx                   |  'serializer': 'cbor.batched',
hwp-pmx                   |  'session': 3133522230821238,
hwp-pmx                   |  'transport': {'channel_framing': 'websocket',
hwp-pmx                   |                'channel_id': {},
hwp-pmx                   |                'channel_serializer': None,
hwp-pmx                   |                'channel_type': 'tcp',
hwp-pmx                   |                'http_cbtid': None,
hwp-pmx                   |                'http_headers_received': None,
hwp-pmx                   |                'http_headers_sent': None,
hwp-pmx                   |                'is_secure': False,
hwp-pmx                   |                'is_server': False,
hwp-pmx                   |                'own': None,
hwp-pmx                   |                'own_fd': -1,
hwp-pmx                   |                'own_pid': 6,
hwp-pmx                   |                'own_tid': 6,
hwp-pmx                   |                'peer': 'tcp4:127.0.0.1:8001',
hwp-pmx                   |                'peer_cert': None,
hwp-pmx                   |                'websocket_extensions_in_use': None,
hwp-pmx                   |                'websocket_protocol': None}}
hwp-pmx                   | 2023-06-20T17:16:48+0000 Failed to register basic handlers @ observatory.PMX; agent probably running already.
hwp-pmx                   | 2023-06-20T17:16:48+0000 session left: CloseDetails(reason=<wamp.close.normal>, message='None')
hwp-pmx                   | 2023-06-20T17:16:48+0000 transport disconnected
hwp-pmx                   | 2023-06-20T17:16:48+0000 waiting for reconnect for 9.999999046325684 more seconds
hwp-pmx                   | 2023-06-20T17:16:48+0000 Scheduling retry 1 to connect <twisted.internet.endpoints.TCP4ClientEndpoint object at 0x7f4314014df0> in 2.422696237903435 seconds.
hwp-pmx                   | 2023-06-20T17:16:49+0000 waiting for reconnect for 8.998652935028076 more seconds
hwp-pmx                   | 2023-06-20T17:16:50+0000 waiting for reconnect for 7.997452259063721 more seconds
hwp-pmx                   | 2023-06-20T17:16:51+0000 transport connected
hwp-pmx                   | 2023-06-20T17:16:51+0000 session joined: {'authextra': {'x_cb_node': 'f04276140250-1',
hwp-pmx                   |                'x_cb_peer': 'tcp4:172.19.0.1:56564',
hwp-pmx                   |                'x_cb_pid': 16,
hwp-pmx                   |                'x_cb_worker': 'worker001'},
hwp-pmx                   |  'authid': 'FPP9-SNVR-9UGX-49KG-7FFE-QGRU',
hwp-pmx                   |  'authmethod': 'anonymous',
hwp-pmx                   |  'authprovider': 'static',
hwp-pmx                   |  'authrole': 'iocs_agent',
hwp-pmx                   |  'realm': 'test_realm',
hwp-pmx                   |  'resumable': False,
hwp-pmx                   |  'resume_token': None,
hwp-pmx                   |  'resumed': False,
hwp-pmx                   |  'serializer': 'cbor.batched',
hwp-pmx                   |  'session': 2018085136488752,
hwp-pmx                   |  'transport': {'channel_framing': 'websocket',
hwp-pmx                   |                'channel_id': {},
hwp-pmx                   |                'channel_serializer': None,
hwp-pmx                   |                'channel_type': 'tcp',
hwp-pmx                   |                'http_cbtid': None,
hwp-pmx                   |                'http_headers_received': None,
hwp-pmx                   |                'http_headers_sent': None,
hwp-pmx                   |                'is_secure': False,
hwp-pmx                   |                'is_server': False,
hwp-pmx                   |                'own': None,
hwp-pmx                   |                'own_fd': -1,
hwp-pmx                   |                'own_pid': 6,
hwp-pmx                   |                'own_tid': 6,
hwp-pmx                   |                'peer': 'tcp4:127.0.0.1:8001',
hwp-pmx                   |                'peer_cert': None,
hwp-pmx                   |                'websocket_extensions_in_use': None,
hwp-pmx                   |                'websocket_protocol': None}}
hwp-pmx                   | 2023-06-20T17:16:51+0000 Failed to register basic handlers @ observatory.PMX; agent probably running already.
hwp-pmx                   | 2023-06-20T17:16:51+0000 session left: CloseDetails(reason=<wamp.close.normal>, message='None')
hwp-pmx                   | 2023-06-20T17:16:51+0000 transport disconnected
hwp-pmx                   | 2023-06-20T17:16:51+0000 waiting for reconnect for 9.999999284744263 more seconds
hwp-pmx                   | 2023-06-20T17:16:51+0000 Scheduling retry 1 to connect <twisted.internet.endpoints.TCP4ClientEndpoint object at 0x7f4314014df0> in 1.6778378242488525 seconds.
hwp-pmx                   | 2023-06-20T17:16:51+0000 waiting for reconnect for 6.99695897102356 more seconds
hwp-pmx                   | 2023-06-20T17:16:52+0000 waiting for reconnect for 8.999045372009277 more seconds
hwp-pmx                   | 2023-06-20T17:16:52+0000 waiting for reconnect for 5.995577335357666 more seconds
hwp-pmx                   | 2023-06-20T17:16:53+0000 transport connected
hwp-pmx                   | 2023-06-20T17:16:53+0000 session joined: {'authextra': {'x_cb_node': 'f04276140250-1',
hwp-pmx                   |                'x_cb_peer': 'tcp4:172.19.0.1:56566',
hwp-pmx                   |                'x_cb_pid': 16,
hwp-pmx                   |                'x_cb_worker': 'worker001'},
hwp-pmx                   |  'authid': 'KGH5-WJJ6-4Y3F-MPAH-RCJG-G9UV',
hwp-pmx                   |  'authmethod': 'anonymous',
hwp-pmx                   |  'authprovider': 'static',
hwp-pmx                   |  'authrole': 'iocs_agent',
hwp-pmx                   |  'realm': 'test_realm',
hwp-pmx                   |  'resumable': False,
hwp-pmx                   |  'resume_token': None,
hwp-pmx                   |  'resumed': False,
hwp-pmx                   |  'serializer': 'cbor.batched',
hwp-pmx                   |  'session': 4015378747466401,
hwp-pmx                   |  'transport': {'channel_framing': 'websocket',
hwp-pmx                   |                'channel_id': {},
hwp-pmx                   |                'channel_serializer': None,
hwp-pmx                   |                'channel_type': 'tcp',
hwp-pmx                   |                'http_cbtid': None,
hwp-pmx                   |                'http_headers_received': None,
hwp-pmx                   |                'http_headers_sent': None,
hwp-pmx                   |                'is_secure': False,
hwp-pmx                   |                'is_server': False,
hwp-pmx                   |                'own': None,
hwp-pmx                   |                'own_fd': -1,
hwp-pmx                   |                'own_pid': 6,
hwp-pmx                   |                'own_tid': 6,
hwp-pmx                   |                'peer': 'tcp4:127.0.0.1:8001',
hwp-pmx                   |                'peer_cert': None,
hwp-pmx                   |                'websocket_extensions_in_use': None,
hwp-pmx                   |                'websocket_protocol': None}}
hwp-pmx                   | 2023-06-20T17:16:53+0000 Failed to register basic handlers @ observatory.PMX; agent probably running already.
hwp-pmx                   | 2023-06-20T17:16:53+0000 session left: CloseDetails(reason=<wamp.close.normal>, message='None')
hwp-pmx                   | 2023-06-20T17:16:53+0000 transport disconnected
hwp-pmx                   | 2023-06-20T17:16:53+0000 waiting for reconnect for 9.999999284744263 more seconds
hwp-pmx                   | 2023-06-20T17:16:53+0000 Scheduling retry 1 to connect <twisted.internet.endpoints.TCP4ClientEndpoint object at 0x7f4314014df0> in 2.435338587378095 seconds.
hwp-pmx                   | 2023-06-20T17:16:53+0000 waiting for reconnect for 7.998492479324341 more seconds
hwp-pmx                   | 2023-06-20T17:16:53+0000 waiting for reconnect for 4.994020700454712 more seconds
hwp-pmx                   | 2023-06-20T17:16:54+0000 waiting for reconnect for 8.999288558959961 more seconds
hwp-pmx                   | 2023-06-20T17:16:54+0000 waiting for reconnect for 6.997734308242798 more seconds
hwp-pmx                   | 2023-06-20T17:16:54+0000 waiting for reconnect for 3.992337226867676 more seconds
hwp-pmx                   | 2023-06-20T17:16:55+0000 waiting for reconnect for 7.998593091964722 more seconds
hwp-pmx                   | 2023-06-20T17:16:55+0000 waiting for reconnect for 5.997457504272461 more seconds
hwp-pmx                   | 2023-06-20T17:16:55+0000 transport connected
hwp-pmx                   | 2023-06-20T17:16:55+0000 session joined: {'authextra': {'x_cb_node': 'f04276140250-1',
hwp-pmx                   |                'x_cb_peer': 'tcp4:172.19.0.1:56568',
hwp-pmx                   |                'x_cb_pid': 16,
hwp-pmx                   |                'x_cb_worker': 'worker001'},
hwp-pmx                   |  'authid': 'MLPV-3M7W-FSVP-5699-YFGM-JVEE',
hwp-pmx                   |  'authmethod': 'anonymous',
hwp-pmx                   |  'authprovider': 'static',
hwp-pmx                   |  'authrole': 'iocs_agent',
hwp-pmx                   |  'realm': 'test_realm',
hwp-pmx                   |  'resumable': False,
hwp-pmx                   |  'resume_token': None,
hwp-pmx                   |  'resumed': False,
hwp-pmx                   |  'serializer': 'cbor.batched',
hwp-pmx                   |  'session': 6589410483556890,
hwp-pmx                   |  'transport': {'channel_framing': 'websocket',
hwp-pmx                   |                'channel_id': {},
hwp-pmx                   |                'channel_serializer': None,
hwp-pmx                   |                'channel_type': 'tcp',
hwp-pmx                   |                'http_cbtid': None,
hwp-pmx                   |                'http_headers_received': None,
hwp-pmx                   |                'http_headers_sent': None,
hwp-pmx                   |                'is_secure': False,
hwp-pmx                   |                'is_server': False,
hwp-pmx                   |                'own': None,
hwp-pmx                   |                'own_fd': -1,
hwp-pmx                   |                'own_pid': 6,
hwp-pmx                   |                'own_tid': 6,
hwp-pmx                   |                'peer': 'tcp4:127.0.0.1:8001',
hwp-pmx                   |                'peer_cert': None,
hwp-pmx                   |                'websocket_extensions_in_use': None,
hwp-pmx                   |                'websocket_protocol': None}}
hwp-pmx                   | 2023-06-20T17:16:55+0000 Failed to register basic handlers @ observatory.PMX; agent probably running already.
hwp-pmx                   | 2023-06-20T17:16:55+0000 session left: CloseDetails(reason=<wamp.close.normal>, message='None')
hwp-pmx                   | 2023-06-20T17:16:55+0000 transport disconnected
hwp-pmx                   | 2023-06-20T17:16:55+0000 waiting for reconnect for 9.999999284744263 more seconds
hwp-pmx                   | 2023-06-20T17:16:55+0000 Scheduling retry 1 to connect <twisted.internet.endpoints.TCP4ClientEndpoint object at 0x7f4314014df0> in 1.9530167246321493 seconds.
hwp-pmx                   | 2023-06-20T17:16:55+0000 waiting for reconnect for 2.9915502071380615 more seconds
hwp-pmx                   | 2023-06-20T17:16:56+0000 waiting for reconnect for 6.997744798660278 more seconds
hwp-pmx                   | 2023-06-20T17:16:56+0000 waiting for reconnect for 4.99638819694519 more seconds
hwp-pmx                   | 2023-06-20T17:16:56+0000 waiting for reconnect for 8.999145030975342 more seconds
hwp-pmx                   | 2023-06-20T17:16:56+0000 waiting for reconnect for 1.9912347793579102 more seconds
hwp-pmx                   | 2023-06-20T17:16:57+0000 waiting for reconnect for 5.996545076370239 more seconds
hwp-pmx                   | 2023-06-20T17:16:57+0000 waiting for reconnect for 3.995455741882324 more seconds
hwp-pmx                   | 2023-06-20T17:16:57+0000 transport connected
hwp-pmx                   | 2023-06-20T17:16:57+0000 session joined: {'authextra': {'x_cb_node': 'f04276140250-1',
hwp-pmx                   |                'x_cb_peer': 'tcp4:172.19.0.1:56570',
hwp-pmx                   |                'x_cb_pid': 16,
hwp-pmx                   |                'x_cb_worker': 'worker001'},
hwp-pmx                   |  'authid': 'X56S-VETF-AMM9-7XM7-M66C-7CT4',
hwp-pmx                   |  'authmethod': 'anonymous',
hwp-pmx                   |  'authprovider': 'static',
hwp-pmx                   |  'authrole': 'iocs_agent',
hwp-pmx                   |  'realm': 'test_realm',
hwp-pmx                   |  'resumable': False,
hwp-pmx                   |  'resume_token': None,
hwp-pmx                   |  'resumed': False,
hwp-pmx                   |  'serializer': 'cbor.batched',
hwp-pmx                   |  'session': 3974295502604361,
hwp-pmx                   |  'transport': {'channel_framing': 'websocket',
hwp-pmx                   |                'channel_id': {},
hwp-pmx                   |                'channel_serializer': None,
hwp-pmx                   |                'channel_type': 'tcp',
hwp-pmx                   |                'http_cbtid': None,
hwp-pmx                   |                'http_headers_received': None,
hwp-pmx                   |                'http_headers_sent': None,
hwp-pmx                   |                'is_secure': False,
hwp-pmx                   |                'is_server': False,
hwp-pmx                   |                'own': None,
hwp-pmx                   |                'own_fd': -1,
hwp-pmx                   |                'own_pid': 6,
hwp-pmx                   |                'own_tid': 6,
hwp-pmx                   |                'peer': 'tcp4:127.0.0.1:8001',
hwp-pmx                   |                'peer_cert': None,
hwp-pmx                   |                'websocket_extensions_in_use': None,
hwp-pmx                   |                'websocket_protocol': None}}
hwp-pmx                   | 2023-06-20T17:16:57+0000 Failed to register basic handlers @ observatory.PMX; agent probably running already.
hwp-pmx                   | 2023-06-20T17:16:57+0000 session left: CloseDetails(reason=<wamp.close.normal>, message='None')
hwp-pmx                   | 2023-06-20T17:16:57+0000 transport disconnected
hwp-pmx                   | 2023-06-20T17:16:57+0000 waiting for reconnect for 9.999999284744263 more seconds
hwp-pmx                   | 2023-06-20T17:16:57+0000 Scheduling retry 1 to connect <twisted.internet.endpoints.TCP4ClientEndpoint object at 0x7f4314014df0> in 1.6011101603058588 seconds.
hwp-pmx                   | 2023-06-20T17:16:57+0000 waiting for reconnect for 7.997954845428467 more seconds
hwp-pmx                   | 2023-06-20T17:16:57+0000 waiting for reconnect for 0.9907383918762207 more seconds
hwp-pmx                   | 2023-06-20T17:16:58+0000 waiting for reconnect for 4.996060371398926 more seconds
hwp-pmx                   | 2023-06-20T17:16:58+0000 waiting for reconnect for 2.994821071624756 more seconds
hwp-pmx                   | 2023-06-20T17:16:58+0000 waiting for reconnect for 8.999286651611328 more seconds
hwp-pmx                   | 2023-06-20T17:16:58+0000 waiting for reconnect for 6.997202634811401 more seconds
hwp-pmx                   | 2023-06-20T17:16:58+0000 stopping reactor
hwp-pmx                   | 2023-06-20T17:16:58+0000 Main loop terminated.
hwp-pmx exited with code 0
jlashner commented 1 year ago

This is running on the hwp_supervisor branch after rebasing onto the most recent main commit

BrianJKoopman commented 1 year ago

Sorry, can you provide steps to reproduce? Or at least more details about what you were doing? Is this occurring when you try to run an agent that already exists?

jlashner commented 1 year ago

Yeah, so this was happening because I changed the name of the docker service, and tried to bring up a new instance without removing the old container. So it was trying to re-create the PMX agent, but the old docker container was still up.