radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

Connection reset during workflow after stage is complete. pika setting is probably still needed #137

Closed Weiming-Hu closed 3 years ago

Weiming-Hu commented 3 years ago

I have experimented with the workflow without pika settings, specifically without the following two lines in my user code:

pika.connection.Parameters.DEFAULT_HEARTBEAT_INTERVAL = 0
pika.connection.Parameters.DEFAULT_HEARTBEAT_TIMEOUT = 0

But then I got the following error in my client-side sandbox:

1612651593.608 : radical.entk.task_manager.0000 : 15773 : 140166928791296 : INFO     : Transition task.0000 to EXECUTED
1612651593.608 : radical.entk.task_manager.0000 : 15773 : 140166928791296 : DEBUG    : task.0000 (EXECUTED) to sync with amgr
1612651593.609 : radical.entk.task_manager.0000 : 15773 : 140166928791296 : ERROR    : Transition task.0000 to state EXECUTED failed, error: (-1, "ConnectionResetError(104, 'Connection reset by peer')")
Traceback (most recent call last):
  File "/glade/u/home/wuh20/venv_Predictability/lib/python3.7/site-packages/radical/entk/execman/base/task_manager.py", line 206, in _advance
    self._sync_with_master(obj, obj_type, channel, queue)
  File "/glade/u/home/wuh20/venv_Predictability/lib/python3.7/site-packages/radical/entk/execman/base/task_manager.py", line 153, in _sync_with_master
    properties=pika.BasicProperties(correlation_id=corr_id))
  File "/glade/u/home/wuh20/venv_Predictability/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 2120, in basic_publish
    mandatory, immediate)
  File "/glade/u/home/wuh20/venv_Predictability/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 2207, in publish
    self._flush_output()
  File "/glade/u/home/wuh20/venv_Predictability/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 1292, in _flush_output
    *waiters)
  File "/glade/u/home/wuh20/venv_Predictability/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 477, in _flush_output
    result.reason_text)
pika.exceptions.ConnectionClosed: (-1, "ConnectionResetError(104, 'Connection reset by peer')")
1612651593.611 : radical.entk.task_manager.0000 : 15773 : 140166928791296 : DEBUG    : task.0000 (DESCRIBED) to sync with amgr
1612651593.611 : radical.entk.task_manager.0000 : 15773 : 140166928791296 : ERROR    : Error in RP callback thread:
Traceback (most recent call last):
  File "/glade/u/home/wuh20/venv_Predictability/lib/python3.7/site-packages/radical/entk/execman/base/task_manager.py", line 206, in _advance
    self._sync_with_master(obj, obj_type, channel, queue)
  File "/glade/u/home/wuh20/venv_Predictability/lib/python3.7/site-packages/radical/entk/execman/base/task_manager.py", line 153, in _sync_with_master
    properties=pika.BasicProperties(correlation_id=corr_id))
  File "/glade/u/home/wuh20/venv_Predictability/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 2120, in basic_publish
    mandatory, immediate)
  File "/glade/u/home/wuh20/venv_Predictability/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 2207, in publish
    self._flush_output()
  File "/glade/u/home/wuh20/venv_Predictability/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 1292, in _flush_output
    *waiters)
  File "/glade/u/home/wuh20/venv_Predictability/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 477, in _flush_output
    result.reason_text)
pika.exceptions.ConnectionClosed: (-1, "ConnectionResetError(104, 'Connection reset by peer')")

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/glade/u/home/wuh20/venv_Predictability/lib/python3.7/site-packages/radical/entk/execman/rp/task_manager.py", line 251, in unit_state_cb
    mq_channel, '%s-cb-to-sync' % self._sid)
  File "/glade/u/home/wuh20/venv_Predictability/lib/python3.7/site-packages/radical/entk/execman/base/task_manager.py", line 213, in _advance
    self._sync_with_master(obj, obj_type, channel, queue)
  File "/glade/u/home/wuh20/venv_Predictability/lib/python3.7/site-packages/radical/entk/execman/base/task_manager.py", line 153, in _sync_with_master
    properties=pika.BasicProperties(correlation_id=corr_id))
  File "/glade/u/home/wuh20/venv_Predictability/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 2120, in basic_publish
    mandatory, immediate)
  File "/glade/u/home/wuh20/venv_Predictability/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 2206, in publish
    immediate=immediate)
  File "/glade/u/home/wuh20/venv_Predictability/lib/python3.7/site-packages/pika/channel.py", line 415, in basic_publish
    raise exceptions.ChannelClosed()
pika.exceptions.ChannelClosed

When I added back those lines to my user code, things are working again. I suppose there are still some leftover issues with pika?

Thank you

lee212 commented 3 years ago

@Weiming-Hu , can you confirm your radical-stack?

This might have to reopen this: https://github.com/radical-cybertools/radical.entk/issues/509

Weiming-Hu commented 3 years ago

Sure thing. Please see below:

(venv_Predictability) wuh20@cheyenne2:~> radical-stack

  python               : /glade/u/home/wuh20/venv_Predictability/bin/python3
  pythonpath           :
  version              : 3.7.9
  virtualenv           : /glade/u/home/wuh20/venv_Predictability

  radical.analytics    : 1.5.0
  radical.entk         : 1.5.8
  radical.gtod         : 1.5.0
  radical.pilot        : 1.5.12
  radical.saga         : 1.5.9
  radical.utils        : 1.5.9

(venv_Predictability) wuh20@cheyenne2:~>
lee212 commented 3 years ago

Thank you @Weiming-Hu , can you try again after you update entk to v1.5.12? The actual versioning got complicated like radical.entk : 1.5.12-v1.5.12@HEAD-detached-at-v1.5.12 but you should have updated pika handling if you see 1.5.12 in the radical-stack.

Weiming-Hu commented 3 years ago

Got it. Please let me try again with the updated version and I will make sure to report back here. Thanks.