icecube / skymap_scanner

A distributed system that performs a likelihood scan of event directions for IceCube real-time alerts using CPU cluster(s) and queue-based message passing.
5 stars 2 forks source link

AMQPConnector - reporting failure: AMQPConnectorSocketConnectError: TimeoutError #261

Closed tianluyuan closed 8 months ago

tianluyuan commented 9 months ago

On a fairly large fraction of nodes client jobs are getting killed prematurely. Full log error output is below.

/usr/local/lib/python3.10/dist-packages/htcondor/__init__.py:49: UserWarning: Neither the environment variable CONDOR_CONFIG, /etc/condor/, /usr/local/etc/, nor ~condor/ contain a condor_config source. Therefore, we are using a null condor_config.
  _warnings.warn(message)
(env) EWMS_PILOT_BROKER_CLIENT: rabbitmq
(env) EWMS_PILOT_BROKER_ADDRESS: localhost
(env) EWMS_PILOT_BROKER_AUTH_TOKEN: ***
(env) EWMS_PILOT_CL_LOG: INFO
(env) EWMS_PILOT_CL_LOG_THIRD_PARTY: WARNING
(env) EWMS_PILOT_DUMP_TASK_OUTPUT: False
(env) EWMS_PILOT_HTCHIRP: False
(env) EWMS_PILOT_HTCHIRP_RATELIMIT_INTERVAL: 60.0
(env) EWMS_PILOT_INIT_TIMEOUT: None
(env) EWMS_PILOT_TASK_TIMEOUT: 3600
(env) EWMS_PILOT_STOP_LISTENING_ON_TASK_ERROR: True
(env) EWMS_PILOT_QUARANTINE_TIME: 0
(env) EWMS_PILOT_CONCURRENT_TASKS: 1
(env) EWMS_PILOT_PREFETCH: 1
(env) SKYSCAN_PROGRESS_INTERVAL_SEC: 60
(env) SKYSCAN_RESULT_INTERVAL_SEC: 120
(env) SKYSCAN_KILL_SWITCH_CHECK_INTERVAL: 300
(env) SKYSCAN_BROKER_CLIENT: rabbitmq
(env) SKYSCAN_BROKER_ADDRESS: 128.105.83.1
(env) SKYSCAN_BROKER_AUTH: ***
(env) SKYSCAN_MQ_TIMEOUT_TO_CLIENTS: 6000
(env) SKYSCAN_MQ_TIMEOUT_FROM_CLIENTS: 259200
(env) SKYSCAN_MQ_CLIENT_TIMEOUT_WAIT_FOR_FIRST_MESSAGE: 3600
(env) EWMS_PILOT_TASK_TIMEOUT: 3600
(env) SKYSCAN_SKYDRIVER_ADDRESS: 
(env) SKYSCAN_SKYDRIVER_AUTH: ***
(env) SKYSCAN_SKYDRIVER_SCAN_ID: 
(env) SKYSCAN_LOG: DEBUG
(env) SKYSCAN_LOG_THIRD_PARTY: WARNING
(env) SKYSCAN_EWMS_PILOT_LOG: WARNING
(env) SKYSCAN_MQ_CLIENT_LOG: INFO
(env) SKYSCAN_MINI_TEST: False
(env) SKYSCAN_CRASH_DUMMY_PROBABILITY: 0.5
2024-01-10 01:06:13.363 [    INFO] root[518916] Root Logger: '' (DEBUG)
2024-01-10 01:06:13.363 [    INFO] root[518916] Third-Party Logger: 'asyncio' (WARNING)
2024-01-10 01:06:13.363 [    INFO] root[518916] Third-Party Logger: 'classad' (WARNING)
2024-01-10 01:06:13.363 [    INFO] root[518916] Third-Party Logger: 'concurrent' (WARNING)
2024-01-10 01:06:13.363 [    INFO] root[518916] Third-Party Logger: 'google' (WARNING)
2024-01-10 01:06:13.363 [    INFO] root[518916] Third-Party Logger: 'htcondor' (WARNING)
2024-01-10 01:06:13.363 [    INFO] root[518916] Third-Party Logger: 'pika' (WARNING)
2024-01-10 01:06:13.363 [    INFO] root[518916] First-Party Logger: 'skymap_scanner' (DEBUG)
2024-01-10 01:06:13.363 [    INFO] root[518916] Specialty Logger: 'ewms-pilot' (WARNING)
2024-01-10 01:06:13.363 [    INFO] root[518916] Specialty Logger: 'mqclient' (INFO)
2024-01-10 01:06:13.363 [ WARNING] skymap_scanner.client.client[518916] client_startup_json: run00000020.evt000000000001.sub000.json
2024-01-10 01:06:13.363 [ WARNING] skymap_scanner.client.client[518916] debug_directory: None
2024-01-10 01:06:13.383 [    INFO] skymap_scanner.client.client[518916] Starting up a Skymap Scanner client for event: startup_json_dict['mq_basename']='1-8:0-64:12-512:24-1704847910'
2024-01-10 01:06:13.383 [    INFO] mqclient.rabbitmq[518916] Requested MQClient for queue 'from-clients-1-8:0-64:12-512:24-1704847910' @ 128.105.83.1
2024-01-10 01:06:13.384 [    INFO] mqclient.rabbitmq[518916] Connecting with parameters=<ConnectionParameters host=128.105.83.1 port=5672 virtual_host=/ ssl=False>
2024-01-10 01:06:23.392 [   ERROR] pika.adapters.utils.connection_workflow[518916] AMQPConnector - reporting failure: AMQPConnectorSocketConnectError: TimeoutError("TCP connection attempt timed out: '128.105.83.1'/(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('128.105.83.1', 5672))")
2024-01-10 01:06:23.392 [   ERROR] pika.adapters.utils.connection_workflow[518916] AMQP connection workflow failed: AMQPConnectionWorkflowFailed: 1 exceptions in all; last exception - AMQPConnectorSocketConnectError: TimeoutError("TCP connection attempt timed out: '128.105.83.1'/(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('128.105.83.1', 5672))"); first exception - None.
2024-01-10 01:06:23.392 [   ERROR] pika.adapters.utils.connection_workflow[518916] AMQPConnectionWorkflow - reporting failure: AMQPConnectionWorkflowFailed: 1 exceptions in all; last exception - AMQPConnectorSocketConnectError: TimeoutError("TCP connection attempt timed out: '128.105.83.1'/(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('128.105.83.1', 5672))"); first exception - None
2024-01-10 01:06:23.392 [   ERROR] pika.adapters.blocking_connection[518916] Connection workflow failed: AMQPConnectionWorkflowFailed: 1 exceptions in all; last exception - AMQPConnectorSocketConnectError: TimeoutError("TCP connection attempt timed out: '128.105.83.1'/(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('128.105.83.1', 5672))"); first exception - None
2024-01-10 01:06:23.393 [   ERROR] pika.adapters.blocking_connection[518916] Error in _create_connection().
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/pika/adapters/blocking_connection.py", line 451, in _create_connection
    raise self._reap_last_connection_workflow_error(error)
pika.exceptions.AMQPConnectionError
2024-01-10 01:06:23.394 [   ERROR] ewms-pilot[518916] 
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ewms_pilot/pilot.py", line 127, in consume_and_reply
    await _consume_and_reply(
  File "/usr/local/lib/python3.10/dist-packages/ewms_pilot/htchirp_tools.py", line 228, in wrapper
    ret = await func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ewms_pilot/pilot.py", line 292, in _consume_and_reply
    async with out_queue.open_pub() as pub, in_queue.open_sub_manual_acking() as sub:
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/mqclient/queue.py", line 186, in open_pub
    pub = await self._create_pub_queue()
  File "/usr/local/lib/python3.10/dist-packages/mqclient/queue.py", line 142, in _create_pub_queue
    return await self._broker_client.create_pub_queue(
  File "/usr/local/lib/python3.10/dist-packages/mqclient/broker_clients/rabbitmq.py", line 626, in create_pub_queue
    await q.connect()
  File "/usr/local/lib/python3.10/dist-packages/mqclient/broker_clients/rabbitmq.py", line 198, in connect
    self.channel = await super().connect()
  File "/usr/local/lib/python3.10/dist-packages/mqclient/broker_clients/rabbitmq.py", line 143, in connect
    self.connection = pika.BlockingConnection(self.parameters)
  File "/usr/local/lib/python3.10/dist-packages/pika/adapters/blocking_connection.py", line 360, in __init__
    self._impl = self._create_connection(parameters, _impl_class)
  File "/usr/local/lib/python3.10/dist-packages/pika/adapters/blocking_connection.py", line 451, in _create_connection
    raise self._reap_last_connection_workflow_error(error)
pika.exceptions.AMQPConnectionError
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/skymap_scanner/client/__main__.py", line 6, in <module>
    client.main()
  File "/usr/local/lib/python3.10/dist-packages/skymap_scanner/client/client.py", line 78, in main
    asyncio.run(
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.10/dist-packages/ewms_pilot/htchirp_tools.py", line 228, in wrapper
    ret = await func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ewms_pilot/pilot.py", line 127, in consume_and_reply
    await _consume_and_reply(
  File "/usr/local/lib/python3.10/dist-packages/ewms_pilot/htchirp_tools.py", line 228, in wrapper
    ret = await func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ewms_pilot/pilot.py", line 292, in _consume_and_reply
    async with out_queue.open_pub() as pub, in_queue.open_sub_manual_acking() as sub:
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/mqclient/queue.py", line 186, in open_pub
    pub = await self._create_pub_queue()
  File "/usr/local/lib/python3.10/dist-packages/mqclient/queue.py", line 142, in _create_pub_queue
    return await self._broker_client.create_pub_queue(
  File "/usr/local/lib/python3.10/dist-packages/mqclient/broker_clients/rabbitmq.py", line 626, in create_pub_queue
    await q.connect()
  File "/usr/local/lib/python3.10/dist-packages/mqclient/broker_clients/rabbitmq.py", line 198, in connect
    self.channel = await super().connect()
  File "/usr/local/lib/python3.10/dist-packages/mqclient/broker_clients/rabbitmq.py", line 143, in connect
    self.connection = pika.BlockingConnection(self.parameters)
  File "/usr/local/lib/python3.10/dist-packages/pika/adapters/blocking_connection.py", line 360, in __init__
    self._impl = self._create_connection(parameters, _impl_class)
  File "/usr/local/lib/python3.10/dist-packages/pika/adapters/blocking_connection.py", line 451, in _create_connection
    raise self._reap_last_connection_workflow_error(error)
pika.exceptions.AMQPConnectionError
tianluyuan commented 8 months ago

Seems like this occurs intermittently; as of this writing I'm not seeing as many node failures. Maybe something with the broker?

tianluyuan commented 8 months ago

Possibly an issue at these sites. Is there a way to blacklist these nodes? Setting !regexp("GP-ARGO.*", GLIDEIN_Site) does not seem to have an effect.

    124 PrivNet=GP-ARGO-dsu-backfill.23259464147e
    222 PrivNet=GP-ARGO-wichita-backfill.5287b250b431
tianluyuan commented 8 months ago

This seems to work !regexp("Wichita", GLIDEIN_Site) && !regexp("Dakota", GLIDEIN_Site)