radical-cybertools / ExTASY

MDEnsemble
Other
1 stars 1 forks source link

Tool hangs at shutdown (GROMACS/LSDMap) #159

Closed oleweidner closed 9 years ago

oleweidner commented 9 years ago
[Callback]: ComputeUnit '551a6eff6bf88ba784d3048f' state changed to Done.
[Callback]: ComputeUnit '551a6eff6bf88ba784d30493' state changed to Done.
[Callback]: ComputeUnit '551a6eff6bf88ba784d3048a' state changed to StagingOutput.
[Callback]: ComputeUnit '551a6eff6bf88ba784d30488' state changed to StagingOutput.
[Callback]: ComputeUnit '551a6eff6bf88ba784d3048a' state changed to Done.
[Callback]: ComputeUnit '551a6eff6bf88ba784d30488' state changed to Done.
[Callback]: ComputePilot '551a6e436bf88ba784d30470' state changed to Canceled.

... never terminates after "Canceled". Might be a RP problem. Tested with 0.24 from PyPi.

oleweidner commented 9 years ago

A subsequent attempt went through

Simulation Execution Time :  101.867
Starting Analysis
[Callback]: ComputeUnit '551a73cf6bf88ba8339908cb' state changed to StagingInput.
[Callback]: ComputeUnit '551a73cf6bf88ba8339908cb' state changed to Executing.
[Callback]: ComputeUnit '551a73cf6bf88ba8339908cb' state changed to Done.
Select + Reweighting step
[Callback]: ComputeUnit '551a73e66bf88ba8339908cc' state changed to Executing.
[Callback]: ComputeUnit '551a73e66bf88ba8339908cc' state changed to Done.
Analysis Execution time :  13.91
Closing session, exiting now ...
[Callback]: ComputePilot '551a71f26bf88ba833990894' state changed to Canceled.
vivek-bala commented 9 years ago

I guess in the first attempt the walltime wasn't enough so some CUs had to be cancelled ? That could have taken some time. How long did it stay in the hung state ?

oleweidner commented 9 years ago

Ok, here's the results of my tests:

TEST 1

Removed 'ALLOCATION' from the config file, should result in an error. Yep:

ExTASY version :  0.1.4-beta-8-gbcbad40
Loading kernel configurations from /private/tmp/test_extasy/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/amber.json
Loading kernel configurations from /private/tmp/test_extasy/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/coco.json
Loading kernel configurations from /private/tmp/test_extasy/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/gromacs.json
Loading kernel configurations from /private/tmp/test_extasy/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/lsdmap.json
Loading kernel configurations from /private/tmp/test_extasy/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/mmpbsa.json
Loading kernel configurations from /private/tmp/test_extasy/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/namd.json
Loading kernel configurations from /private/tmp/test_extasy/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/sleep.json
Loading kernel configurations from /private/tmp/test_extasy/lib/python2.7/site-packages/radical/ensemblemd/mdkernels/configs/test.json
Session UID: rp.session.teahupoo.local.oweidner.016553.0002 
An error occurred: 'module' object has no attribute 'ALLOCATION'
Exception triggered, no session created, exiting now...
^C^C^C^C

Hangs indefinitely - Ctrl+C doesn't work. The former might be an issue of proper error-handling in ExTASY (all exceptions caught and session.close() called?). The latter is a well-known problem with radical-pilot: https://github.com/radical-cybertools/radical.pilot/issues/448

TEST 2

Forced pilot to fail by using 'localhost' but set an allocation (not supported on localhost). This results in an error and subsequent clean shutdown:

Pilot pilot.0000 has FAILED. Can't recover.
Pilot log:- 
{'timestamp': datetime.datetime(2015, 4, 28, 6, 57, 33, 638000), 'message': u'Using bootstrapper /private/tmp/test_extasy/lib/python2.7/site-packages/radical.pilot-0.28-py2.7.egg/radical/pilot/bootstrapper/default_bootstrapper.sh'}
{'timestamp': datetime.datetime(2015, 4, 28, 6, 57, 33, 639000), 'message': u"Copying bootstrapper 'file://localhost//private/tmp/test_extasy/lib/python2.7/site-packages/radical.pilot-0.28-py2.7.egg/radical/pilot/bootstrapper/default_bootstrapper.sh' to agent sandbox (file://localhost/Users/oweidner/radical.pilot.sandbox/rp.session.teahupoo.local.oweidner.016553.0003-pilot.0000//pilot_bootstrapper.sh)."}
{'timestamp': datetime.datetime(2015, 4, 28, 6, 57, 34, 423000), 'message': u"Copying sdist 'file://localhost//private/tmp/test_extasy/lib/python2.7/site-packages/radical.utils-0.28-py2.7.egg/radical/utils/radical.utils-0.28.tar.gz' to sdist sandbox (file://localhost/Users/oweidner/radical.pilot.sandbox/rp.session.teahupoo.local.oweidner.016553.0003-pilot.0000/)."}
{'timestamp': datetime.datetime(2015, 4, 28, 6, 57, 34, 647000), 'message': u"Copying sdist 'file://localhost//tmp/test_extasy/lib/python2.7/site-packages/saga_python-0.28-py2.7.egg/saga/saga-python-0.28.tar.gz' to sdist sandbox (file://localhost/Users/oweidner/radical.pilot.sandbox/rp.session.teahupoo.local.oweidner.016553.0003-pilot.0000/)."}
{'timestamp': datetime.datetime(2015, 4, 28, 6, 57, 34, 870000), 'message': u"Copying sdist 'file://localhost//private/tmp/test_extasy/lib/python2.7/site-packages/radical.pilot-0.28-py2.7.egg/radical/pilot/controller/..//radical.pilot-0.28.tar.gz' to sdist sandbox (file://localhost/Users/oweidner/radical.pilot.sandbox/rp.session.teahupoo.local.oweidner.016553.0003-pilot.0000/)."}
{'timestamp': datetime.datetime(2015, 4, 28, 6, 57, 37, 596000), 'message': u'Submitting SAGA job with description: {\'Queue\': \'normal\', \'Executable\': \'/bin/bash\', \'TotalPhysicalMemory\': None, \'WorkingDirectory\': \'/Users/oweidner/radical.pilot.sandbox/rp.session.teahupoo.local.oweidner.016553.0003-pilot.0000/\', \'Project\': \'none\', \'WallTimeLimit\': 20, \'Arguments\': [\'-l pilot_bootstrapper.sh\', " -b \'radical.utils-0.28.tar.gz:saga-python-0.28.tar.gz:radical.pilot-0.28.tar.gz\' -c \'2\' -d \'10\' -g \'/Users/oweidner/radical.pilot.sandbox/ve_localhost\' -j \'FORK\' -k \'MPIEXEC\' -l \'FORK\' -m \'extasy-db.epcc.ed.ac.uk:27017\' -n \'radicalpilot\' -o \'SHELL\' -p \'pilot.0000\' -q \'CONTINUOUS\' -r \'20\' -s \'rp.session.teahupoo.local.oweidner.016553.0003\' -t \'multicore\' -u \'create\' -v \'debug\' -a \'extasy:extasyproject\'"], \'Error\': \'agent.err\', \'Output\': \'agent.out\', \'TotalCPUCount\': 2}'}
{'timestamp': datetime.datetime(2015, 4, 28, 6, 57, 37, 601000), 'message': u"Pilot launching failed! ('JobDescription.Project' (none) is not supported by adaptor saga.adaptor.shell_job (/tmp/test_extasy/lib/python2.7/site-packages/saga_python-0.28-py2.7.egg/saga/job/service.py +300 (create_job)  :  raise se.BadParameter._log (self._logger, msg)))"}
Pilot STDOUT : 
Pilot STDERR : 
Execution was interrupted
Closing session, exiting now ...

CONCLUSION

I think this ticket can be closed as things seem to terminate properly in the common error cases. Hijacking terminals is a known problem with radical-pilot and will be addressed as part of the "Refactor-2" milestone (https://github.com/radical-cybertools/radical.pilot/milestones/MS-Refactor-2) in radical-pilot (Andre might have details w.r.t. the timeline).

On a related note

During the tests above I came to realize that it is not possible to run ExTASY locally (REMOTE_HOST=localhost). This might be a completely irrelevant use-case, but at least for development purposes this might come in handy sometimes? Currently the ExTASY resource config file requires me to set ALLOCATION (otherwise error An error occurred: 'module' object has no attribute 'ALLOCATION'). However, if I set it to a bogus value, I get a different error (Pilot launching failed! ('JobDescription.Project' (bogus) is not supported by adaptor saga.adaptor.shell_job). I think this could be easily fixed by making ALLOCATION an optional parameter in the resource.cfg.

ibethune commented 9 years ago

We don't support (in 0.1) running locally. I will add this to the list of things to test for 0.2.

Let's leave the ticket open so Vivek can investigate the cause of the hang when there is a config file error.

andre-merzky commented 9 years ago

This is indeed most likely an RP problem, as Ole said -- we have a couple of tickets in the context of shutdowns.

vivek-bala commented 9 years ago

I have made the allocation and queue parameters optional and the error should now be more informative.

vivek-bala commented 9 years ago

I have tried a couple of methods to exit properly from extasy. a sys.exit() is not working though, as ole and andre said, this might be from the RP layer.

vivek-bala commented 9 years ago

I have an example in another branch to run extasy locally(local_example). Some more comments/documentation could be added to it though. Will add those for further discussions in the next meeting.