Closed AymenFJA closed 12 months ago
@aymen - can you please rerun with the current RU devel? This PR should have solved the problem. Thanks!
@aymen - can you please rerun with the current RU devel? This PR should have solved the problem. Thanks!
@andre-merzky same behavior with utils
devel:
env at /cache/home/afa64/ve/facts exists
---------------------------------------------------------------------
PWD : /scratch/afa64/radical.pilot.sandbox/re.session.hal0155.amarel.rutgers.edu.afa64.019657.0000/pilot.0000
ENV : /scratch/afa64/radical.pilot.sandbox/re.session.hal0155.amarel.rutgers.edu.afa64.019657.0000/pilot.0000//env/rp_named_env.rp.env
SCRIPT : /cache/home/afa64/ve/facts/bin/radical-pilot-create-static-ve
PREFIX : /cache/home/afa64/ve/facts
VERSION : 3.6
MODULES : apache-libcloud chardet colorama idna msgpack msgpack-python netifaces ntplib parse dill pyzmq regex requests setproctitle urllib3
DEFAULTS : True
PYTHON : /cache/home/afa64/ve/facts/bin/python3 (Python 3.6.8)
PYTHONPATH: /scratch/afa64/radical.pilot.sandbox/re.session.hal0155.amarel.rutgers.edu.afa64.019657.0000/pilot.0000/rp_install/lib/python3.6/site-packages::
RCT_STACK :
python : /cache/home/afa64/ve/facts/bin/python3
pythonpath : /scratch/afa64/radical.pilot.sandbox/re.session.hal0155.amarel.rutgers.edu.afa64.019657.0000/pilot.0000/rp_install/lib/python3.6/site-packages::
version : 3.6.8
virtualenv : /cache/home/afa64/ve/facts
radical.entk : 1.41.0
radical.gtod : 1.41.0
radical.pilot : 1.41.0
radical.saga : 1.41.0
radical.utils : 1.42.0-v1.41.0-16-g357e032@devel
---------------------------------------------------------------------
1698419422.891 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO : Transition task.000360 to state SCHEDULED
1698419422.894 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO : Transition task.000360 to state SCHEDULED
1698419422.895 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO : Transition task.000364 to state SCHEDULED
1698419422.900 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO : Transition task.000364 to state SCHEDULED
1698419422.901 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO : Transition task.000368 to state SCHEDULED
1698419422.904 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO : Transition task.000368 to state SCHEDULED
1698419422.905 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO : Transition task.000372 to state SCHEDULED
1698419422.913 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO : Transition task.000372 to state SCHEDULED
1698419422.913 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO : Transition task.000376 to state SCHEDULED
1698419422.920 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO : Transition task.000376 to state SCHEDULED
1698419422.922 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO : Transition task.000380 to state SCHEDULED
1698419422.924 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO : Transition task.000380 to state SCHEDULED
1698419422.930 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO : Transition task.000384 to state SCHEDULED
1698419422.932 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO : Transition task.000384 to state SCHEDULED
1698419422.935 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO : Transition task.000388 to state SCHEDULED
1698419422.939 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO : Transition task.000388 to state SCHEDULED
1698419422.941 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO : Transition task.000392 to state SCHEDULED
1698419422.944 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO : Transition task.000392 to state SCHEDULED
1698419422.945 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO : Transition task.000396 to state SCHEDULED
1698419422.947 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO : Transition task.000396 to state SCHEDULED
1698419422.949 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO : Transition stage.0000 to state SCHEDULED
1698419422.951 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO : Transition stage.0000 to state SCHEDULED
1698419422.953 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO : Transition stage.0004 to state SCHEDULED
1698419422.955 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO : Transition stage.0004 to state SCHEDULED
1698419422.956 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO : Transition stage.0008 to state SCHEDULED
1698419422.960 : radical.entk.wfprocessor.0000 : 221460 : 139679500326656 : INFO : Transition stage.0008 to state SCHEDULED
Thanks @AymenFJA . Alas I can't access the sandbox at /scratch/afa64/radical.pilot.sandbox/re.session.hal0155.amarel.rutgers.edu.afa64.019657.0000/pilot.0000
- could you please tar it up and attach it, or make it otherwise available? Thanks!
@andre-merzky 2 sessions are attached with 2 different modes: interactive_batch.zip login_node.zip
as a side note - should we try with different python version? following env setup we have in the corresponding config https://github.com/radical-cybertools/radical.pilot/blob/7d6864e4d5a180062e41fcd8ba0135161fbce58d/src/radical/pilot/configs/resource_rutgers.json#L29-L33
EnTK
task manager process seems to be stuck and not getting spawned here:
https://github.com/radical-cybertools/radical.entk/blob/5d7700718fb2840b512cc5cc83e0ff84bf32ea36/src/radical/entk/execman/rp/task_manager.py#L348-L361
Update: @mtitov and I interactively tested this issue on Amarel and it was confirmed that it only happens on 3 nodes and with different number of tasks. This issue doesn't happen with 4 nodes. Our conclusion is that it is an issue of SLURM with 3 nodes. @andre-merzky what do you think should we investigate it more if so how? If not should we conclude it as a non-RP issue (which clearly is) and close this ticket?
Update: This behavior is back with Amarel on one node. I can also confirm that the same behavior is happening on UVA Rivanna.
Proposal: PR to use threaded manager
Closing this in correspondence to https://github.com/radical-cybertools/radical.entk/issues/656
I saw this behavior via
EnTK
, tested cases:passes
(25 pipelines).passes
(50 pipelines).fails
(100 pipelines).RCT Stack:
Access mode
batch mode
:from
radical.entk.wfprocessor.0000.log
:Then everything hangs until it times out. No task folders were created by RP in the agent sandbox.