radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

Radical-Pilot hangs on Amarel #3074

Closed AymenFJA closed 12 months ago

AymenFJA commented 1 year ago

I saw this behavior via EnTK, tested cases:

RCT Stack:


  python               : /cache/home/afa64/ve/facts/bin/python3
  pythonpath           :
  version              : 3.6.8
  virtualenv           : /cache/home/afa64/ve/facts

  radical.entk         : 1.41.0
  radical.gtod         : 1.41.0
  radical.pilot        : 1.41.0
  radical.saga         : 1.41.0
  radical.utils        : 1.41.0

Access mode batch mode:

res_dict = {
    'resource': 'rutgers.amarel',
    'walltime': 90,
    'cpus': 72,
    'access_schema': 'interactive',
}

from radical.entk.wfprocessor.0000.log:

1698359774.076 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000388 to state SCHEDULING
1698359774.076 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000388 to state SCHEDULING
1698359774.076 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition pipeline.0098 to state SCHEDULING
1698359774.076 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition pipeline.0098 to state SCHEDULING
1698359774.076 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition stage.0392 to state SCHEDULING
1698359774.076 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition stage.0392 to state SCHEDULING
1698359774.076 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000392 to state SCHEDULING
1698359774.076 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000392 to state SCHEDULING
1698359774.076 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition pipeline.0099 to state SCHEDULING
1698359774.076 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition pipeline.0099 to state SCHEDULING
1698359774.076 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition stage.0396 to state SCHEDULING
1698359774.076 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition stage.0396 to state SCHEDULING
1698359774.076 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000396 to state SCHEDULING
1698359774.077 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000396 to state SCHEDULING
1698359774.107 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : DEBUG    : Workload submitted to Task Manager
1698359774.274 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000000 to state SCHEDULED
1698359774.380 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000000 to state SCHEDULED
1698359774.393 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000004 to state SCHEDULED
1698359774.826 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000004 to state SCHEDULED
1698359775.161 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000008 to state SCHEDULED
1698359775.658 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000008 to state SCHEDULED
1698359775.996 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000012 to state SCHEDULED
1698359776.334 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000012 to state SCHEDULED
1698359776.562 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000016 to state SCHEDULED
1698359776.775 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000016 to state SCHEDULED
1698359776.826 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000020 to state SCHEDULED
1698359777.251 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000020 to state SCHEDULED
1698359777.377 : radical.entk.wfprocessor.0000 : 39068 : 139945427576576 : INFO     : Transition task.000024 to state SCHEDULED

Then everything hangs until it times out. No task folders were created by RP in the agent sandbox.

andre-merzky commented 1 year ago

@aymen - can you please rerun with the current RU devel? This PR should have solved the problem. Thanks!

AymenFJA commented 1 year ago

@aymen - can you please rerun with the current RU devel? This PR should have solved the problem. Thanks!

@andre-merzky same behavior with utils devel:

env at /cache/home/afa64/ve/facts exists

---------------------------------------------------------------------

PWD       : /scratch/afa64/radical.pilot.sandbox/re.session.hal0155.amarel.rutgers.edu.afa64.019657.0000/pilot.0000
ENV       : /scratch/afa64/radical.pilot.sandbox/re.session.hal0155.amarel.rutgers.edu.afa64.019657.0000/pilot.0000//env/rp_named_env.rp.env
SCRIPT    : /cache/home/afa64/ve/facts/bin/radical-pilot-create-static-ve
PREFIX    : /cache/home/afa64/ve/facts
VERSION   : 3.6
MODULES   :  apache-libcloud chardet colorama idna msgpack msgpack-python netifaces ntplib parse dill pyzmq regex requests setproctitle urllib3
DEFAULTS  : True
PYTHON    : /cache/home/afa64/ve/facts/bin/python3 (Python 3.6.8)
PYTHONPATH: /scratch/afa64/radical.pilot.sandbox/re.session.hal0155.amarel.rutgers.edu.afa64.019657.0000/pilot.0000/rp_install/lib/python3.6/site-packages::
RCT_STACK :
  python               : /cache/home/afa64/ve/facts/bin/python3
  pythonpath           : /scratch/afa64/radical.pilot.sandbox/re.session.hal0155.amarel.rutgers.edu.afa64.019657.0000/pilot.0000/rp_install/lib/python3.6/site-packages::
  version              : 3.6.8
  virtualenv           : /cache/home/afa64/ve/facts

  radical.entk         : 1.41.0
  radical.gtod         : 1.41.0
  radical.pilot        : 1.41.0
  radical.saga         : 1.41.0
  radical.utils        : 1.42.0-v1.41.0-16-g357e032@devel

---------------------------------------------------------------------
1698419422.891 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000360 to state SCHEDULED
1698419422.894 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000360 to state SCHEDULED
1698419422.895 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000364 to state SCHEDULED
1698419422.900 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000364 to state SCHEDULED
1698419422.901 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000368 to state SCHEDULED
1698419422.904 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000368 to state SCHEDULED
1698419422.905 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000372 to state SCHEDULED
1698419422.913 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000372 to state SCHEDULED
1698419422.913 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000376 to state SCHEDULED
1698419422.920 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000376 to state SCHEDULED
1698419422.922 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000380 to state SCHEDULED
1698419422.924 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000380 to state SCHEDULED
1698419422.930 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000384 to state SCHEDULED
1698419422.932 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000384 to state SCHEDULED
1698419422.935 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000388 to state SCHEDULED
1698419422.939 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000388 to state SCHEDULED
1698419422.941 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000392 to state SCHEDULED
1698419422.944 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000392 to state SCHEDULED
1698419422.945 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000396 to state SCHEDULED
1698419422.947 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition task.000396 to state SCHEDULED
1698419422.949 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition stage.0000 to state SCHEDULED
1698419422.951 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition stage.0000 to state SCHEDULED
1698419422.953 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition stage.0004 to state SCHEDULED
1698419422.955 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition stage.0004 to state SCHEDULED
1698419422.956 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition stage.0008 to state SCHEDULED
1698419422.960 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO     : Transition stage.0008 to state SCHEDULED
andre-merzky commented 1 year ago

Thanks @AymenFJA . Alas I can't access the sandbox at /scratch/afa64/radical.pilot.sandbox/re.session.hal0155.amarel.rutgers.edu.afa64.019657.0000/pilot.0000 - could you please tar it up and attach it, or make it otherwise available? Thanks!

AymenFJA commented 1 year ago

@andre-merzky 2 sessions are attached with 2 different modes: interactive_batch.zip login_node.zip

mtitov commented 1 year ago

as a side note - should we try with different python version? following env setup we have in the corresponding config https://github.com/radical-cybertools/radical.pilot/blob/7d6864e4d5a180062e41fcd8ba0135161fbce58d/src/radical/pilot/configs/resource_rutgers.json#L29-L33

AymenFJA commented 1 year ago

EnTK task manager process seems to be stuck and not getting spawned here: https://github.com/radical-cybertools/radical.entk/blob/5d7700718fb2840b512cc5cc83e0ff84bf32ea36/src/radical/entk/execman/rp/task_manager.py#L348-L361

AymenFJA commented 1 year ago

Update: @mtitov and I interactively tested this issue on Amarel and it was confirmed that it only happens on 3 nodes and with different number of tasks. This issue doesn't happen with 4 nodes. Our conclusion is that it is an issue of SLURM with 3 nodes. @andre-merzky what do you think should we investigate it more if so how? If not should we conclude it as a non-RP issue (which clearly is) and close this ticket?

AymenFJA commented 1 year ago

Update: This behavior is back with Amarel on one node. I can also confirm that the same behavior is happening on UVA Rivanna.

andre-merzky commented 12 months ago

Proposal: PR to use threaded manager

AymenFJA commented 12 months ago

Closing this in correspondence to https://github.com/radical-cybertools/radical.entk/issues/656