radical-collaboration / extasy-bpti

0 stars 1 forks source link

"reset final state to FAILED" #17

Open kevloui opened 5 years ago

kevloui commented 5 years ago

Managed to get a job started on bluewaters, but now we encounter a different error. See log below:

bootstrap_0.out.log

# push final pilot state: re.session.ip-172-31-21-178.ubuntu.017928.0004 pilot.0000 FAILED
which: no radical-pilot-agent-statepush in (/scratch/sciteam/louison/radical.pilot.sandbox/re.session.ip-172-31-21-178.ubuntu.017928.0004/pilot.0000/rp_install/bin:/scratch/sciteam/louison/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.50.21/bin:/mnt/bwpy/single/bin:/mnt/bwpy/single/usr/bin:/sw/bw/bwpy/mnt/bin:/opt/bwpy/bin:/opt/cray/pmi/5.0.10-1.0000.11050.179.3.gem/bin:/opt/gcc/4.9.3/bin:/sw/xe/darshan/3.1.3/darshan-3.1.3/bin:/sw/EasyBuild/software/gnuplot/5.0.5/bin:/sw/EasyBuild/software/wget/1.19.4/bin:/sw/EasyBuild/software/git/2.17.0/bin:/sw/EasyBuild/software/cURL/7.59.0/bin:/sw/EasyBuild/software/OpenSSL/1.0.2m/bin:/sw/admin/scripts:/sw/user/scripts:/opt/xalt/0.7.6/sles11.3/libexec:/opt/xalt/0.7.6/sles11.3/bin:/opt/moab/9.1.2/sbin:/opt/torque/6.0.4/sbin:/opt/torque/6.0.4/bin:/opt/cray/mpt/7.5.0/gni/bin:/opt/cray/craype/2.5.8/bin:/opt/cray/llm/default/bin:/opt/cray/llm/default/etc:/opt/cray/xpmem/0.1-2.0502.64982.7.24.gem/bin:/opt/cray/ugni/6.0-1.0502.10863.8.28.gem/bin:/opt/cray/udreg/2.3.2-1.0502.10518.2.17.gem/bin:/opt/cray/lustre-cray_gem_s/2.5_3.0.101_0.46.1_1.0502.8871.43.1-1.0502.21728.74.6/sbin:/opt/cray/lustre-cray_gem_s/2.5_3.0.101_0.46.1_1.0502.8871.43.1-1.0502.21728.74.6/bin:/opt/cray/alps/5.2.4-2.0502.9774.31.12.gem/sbin:/opt/cray/alps/5.2.4-2.0502.9774.31.12.gem/bin:/opt/cray/sdb/1.1-1.0502.63652.4.27.gem/bin:/opt/cray/nodestat/2.2-1.0502.60539.1.31.gem/bin:/opt/modules/3.2.10.5/bin:/opt/moab/9.1.2/bin:/u/sciteam/louison/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/games:/usr/lib/mit/bin:/usr/lib/mit/sbin:.:/usr/lib/qt3/bin:/opt/cray/bin)
Traceback (most recent call last):
  File "agent_0.cfg", line 71, in <module>
    "cu_post_exec": null, 
NameError: name 'null' is not defined

# -------------------------------------------------------------------
#
# Done, exiting (2)
#
# -------------------------------------------------------------------
vivek-bala commented 5 years ago

Hey @andre-merzky just bringing this to your attention. Seems to be an error in radical-pilot-agent-statepush

andre-merzky commented 5 years ago

Thanks @vivek-bala, I'll track this down! @Keverne , could you please attach agent_0.cfg from the pilot sandbox? Thank you!

kevloui commented 5 years ago

Hi all, here is the agent_0.cfg file from the pilot sandbox!

agent.cfg.log

andre-merzky commented 5 years ago

Sorry, it took me a while to understand this error - thanks for the details provided! It seems like radical-pilot-agent-statepush is not correctly installed in the pilot sandbox. Can you please list the content of 0.50.21 /scratch/sciteam/louison/radical.pilot.sandbox/re.session.ip-172-31-21-178.ubuntu.017928.0004/pilot.0000/rp_install/bin/?

That error happens after all is said and done, it should not be fatal. I'll provide a fix to shield against it, but still would like to understand how that happened. I don't see that issue on BW with the same RP release...

Thanks, Andre.

andre-merzky commented 5 years ago

Please give the RP branch fix/bpti_17 a try. I'll release this as soon as it is confirmed to work. Thanks!

ChrisSuess commented 5 years ago

Hi @andre-merzky does this need to be installed locally or on blue waters?

andre-merzky commented 5 years ago

The RP installation from your local submission host is on the fly transferred to BW and also installed there, in the pilot sandbox. So the local install is what you are looking for.

Best, Andre.

ChrisSuess commented 5 years ago

when I install the fix/bpti_17 i get the following error

Traceback (most recent call last):
  File "runme.py", line 621, in <module>
    appman.resource_desc = res_dict
  File "/home/ubuntu/shared/ve/local/lib/python2.7/site-packages/radical/entk/appman/appmanager.py", line 195, in resource_desc
    from radical.entk.execman.rp import ResourceManager
  File "/home/ubuntu/shared/ve/local/lib/python2.7/site-packages/radical/entk/execman/rp/__init__.py", line 1, in <module>
    from resource_manager import ResourceManager
  File "/home/ubuntu/shared/ve/local/lib/python2.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 6, in <module>
    import radical.pilot as rp
ImportError: No module named pilot
andre-merzky commented 5 years ago

This looks like the installation did not actually succeed? Can you send the commands you used to install, and their output? Was the virtualenv active during installation?

andre-merzky commented 5 years ago

ping - does the deployment problem persist?

ChrisSuess commented 5 years ago

I created a new virtual environment so there was no legacy code that could have caused issues. I then went through my install procedure installing the latest Pip RP version. This worked (until #18) albeit giving me a slightly different output. I don't know if something had changed recently or if it was more an issue on my end.

Im cautious to say that this deployment problem is no longer an issue but i think it has been resolved!

andre-merzky commented 5 years ago

Thanks, Chris. Glad this is not stalling you anymore, but the recent surge in deployment issues (not only this ticket) worries me... Anyway, we probably should close this ticket unless you or Vivek see a way to reproduce this?