radical-cybertools / ExTASY

MDEnsemble
Other
1 stars 1 forks source link

emnd from master fails with grlsd on ARCHER #223

Closed ibethune closed 8 years ago

ibethune commented 8 years ago

I started commenting on #219, but I find enough issues that it is worth starting a new issue, rather than confusing the issue with Elena's problems (since I don't have the same symptoms).

I tested this using the latest ensemblemd from the master branch, and RP release:

(extasy-test)mbp-ib:grlsd-on-archer ibethune$ radicalpilot-version 
0.37.10
(extasy-test)mbp-ib:grlsd-on-archer ibethune$ ensemblemd-version 
0.3.6-80-gc82ad5f
(extasy-test2)mbp-ib:grlsd-on-archer ibethune$ python extasy_gromacs_lsdmap.py --RPconfig archer.rcfg --Kconfig gromacslsdmap.wcfg

================================================================================
 EnsembleMD (0.3.6-80-gc82ad5f)                                                 
================================================================================

Starting Allocation                                                           ok
        Verifying pattern                                                     ok
        Starting pattern execution                                            ok
--------------------------------------------------------------------------------
Executing simulation-analysis loop with 1 iterations on 24 allocated core(s) on 'epsrc.archer'

Job waiting on queue...
Job is now running !
Waiting for pre_loop step to complete.                                      done
Iteration 1: Waiting for simulation tasks: md.gromacs to complete           done
Iteration 1: Waiting for analysis tasks: md.pre_lsdmap to complete2015-12-17 11:56:32,399: radical.enmd.simulation_analysis_loop.static.default: MainProcess                     : Thread-3       : ERROR   : ComputeUnit error: STDERR: gromacs(3):ERROR:105: Unable to locate a modulefile for 'gromacs/5.0.0'
sh: trjconv: command not found
, STDOUT: gromacs(3):ERROR:105: Unable to locate a modulefile for 'gromacs/5.0.0'
sh: trjconv: command not found

2015-12-17 11:56:32,400: radical.enmd.simulation_analysis_loop.static.default: MainProcess                     : Thread-3       : ERROR   : Pattern execution FAILED.
2015-12-17 11:56:32,400: radical.pilot       : MainProcess                     : Thread-3       : ERROR   : unit manager controller thread caught system exit -- forcing application shutdown
Traceback (most recent call last):
  File "/Users/ibethune/Desktop/extasy-test2/lib/python2.7/site-packages/radical/pilot/controller/unit_manager_controller.py", line 261, in run
    self.call_unit_state_callbacks(unit_id, new_state)
  File "/Users/ibethune/Desktop/extasy-test2/lib/python2.7/site-packages/radical/pilot/controller/unit_manager_controller.py", line 198, in call_unit_state_callbacks
    cb(self._shared_data[unit_id]['facade_object'], new_state)
  File "/Users/ibethune/Desktop/extasy-test2/lib/python2.7/site-packages/radical/ensemblemd/exec_plugins/simulation_analysis_loop/static.py", line 141, in unit_state_cb
    sys.exit(1)
SystemExit: 1
Execution interuptedTraceback (most recent call last):
  File "/Users/ibethune/Desktop/extasy-test2/lib/python2.7/site-packages/radical/ensemblemd/exec_plugins/simulation_analysis_loop/static.py", line 743, in execute_pattern
    resource._umgr.wait_units(uids)
  File "/Users/ibethune/Desktop/extasy-test2/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 697, in wait_units
    time.sleep (0.5)
KeyboardInterrupt

        Starting DeallocationResource allocation cancelled. You probably ran out of walltime
        done 

I have placed the contents of the pilot directory in /work/e290/e290/shared/iain/rp.session.mbp-ib.epcc.ed.ac.uk.ibethune.016786.0003-pilot.0000

A couple of things that I observed:

There are several fixes needed here:

vivek-bala commented 8 years ago

The updates were made directly to master and devel was out of sync from master. Merged the two. I have pushed the two fixes as well.

vivek-bala commented 8 years ago

devel and master should both be fixed now. Moving the error handling issues to another issue.

ibethune commented 8 years ago

Thanks, that's now working. I have a problem with the lsdmap CU, will raise a new issue for that later.