radical-cybertools / ExTASY

MDEnsemble
Other
1 stars 1 forks source link

Amber/CoCo tutorial job failing on Archer #228

Closed CharlieLaughton closed 8 years ago

CharlieLaughton commented 8 years ago

I am trying to run the Amber/CoCo job described in the tutorial on Archer. Except for setting my username and account I have touched nothing. The job fails with the following messages:

[ExTASY-tools] charlie@homer 115% python extasy_amber_coco.py --RPconfig archer.rcfg --Kconfig cocoamber.wcfg

================================================================================
 EnsembleMD (0.3.14)                                                            
================================================================================

Starting Allocation                                                           ok
        Verifying pattern                                                     ok
        Starting pattern execution                                            ok
--------------------------------------------------------------------------------
Executing simulation-analysis loop with 2 iterations on 24 allocated core(s) on 'epsrc.archer'

Job waiting on queue...
Job is now running !
Waiting for pre_loop step to complete.                                      done
Iteration 1: Waiting for simulation tasks: md.amber to complete             done
Iteration 1: Waiting for simulation tasks: md.amber to complete             done
Iteration 1: Waiting for analysis tasks: md.coco to complete2015-12-18 17:21:06,543: radical.enmd.simulation_analysis_loop.static.default: MainProcess                     : Thread-3       : ERROR   : ComputeUnit error: STDERR: unit stderr contains binary data -- use file staging directives, STDOUT: unit stderr contains binary data -- use file staging directives
2015-12-18 17:21:06,543: radical.enmd.simulation_analysis_loop.static.default: MainProcess                     : Thread-3       : ERROR   : Pattern execution FAILED.
2015-12-18 17:21:06,543: radical.pilot       : MainProcess                     : Thread-3       : ERROR   : unit manager controller thread caught system exit -- forcing application shutdown
Traceback (most recent call last):
  File "/users/charlie/ExTASY-tools/lib/python2.7/site-packages/radical/pilot/controller/unit_manager_controller.py", line 261, in run
    self.call_unit_state_callbacks(unit_id, new_state)
  File "/users/charlie/ExTASY-tools/lib/python2.7/site-packages/radical/pilot/controller/unit_manager_controller.py", line 198, in call_unit_state_callbacks
    cb(self._shared_data[unit_id]['facade_object'], new_state)
  File "/users/charlie/ExTASY-tools/lib/python2.7/site-packages/radical/ensemblemd/exec_plugins/simulation_analysis_loop/static.py", line 141, in unit_state_cb
    sys.exit(1)
SystemExit: 1
Execution interuptedTraceback (most recent call last):
  File "/users/charlie/ExTASY-tools/lib/python2.7/site-packages/radical/ensemblemd/exec_plugins/simulation_analysis_loop/static.py", line 741, in execute_pattern
    resource._umgr.wait_units()
  File "/users/charlie/ExTASY-tools/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 697, in wait_units
    time.sleep (0.5)
KeyboardInterrupt

        Starting Deallocation                                               done 
[ExTASY-tools] charlie@homer 116% 

I have made the pilot job folder on Archer publicaly accessible:

/work/e290/e290/e290/e290cl//work/radical.pilot.sandbox/rp.session.homer.pharm.nottingham.ac.uk.charlie.016787.0001-pilot.0000
andre-merzky commented 8 years ago

Thanks for opening the sandbox! Alas, access to /work/e290/e290/e290cl/ is denied, could you please open that for read/exec, too (and the other intermediate directories I assume), please? Thanks!

vivek-bala commented 8 years ago

This seems to have been because of some older modules being used. I have made the change now.

I was looking at the jenkins runs for the status. Surprisingly, it shows success when the runs actually failed ! http://ci.radical-project.org/job/ExTASY-0.2/ I think it might have to do with the error codes when the pilot fails.

vivek-bala commented 8 years ago

older modules - numpy/1.8.0 and scipy/0.13.* on archer. which have been removed.

vivek-bala commented 8 years ago

But not sure why the compute unit error reads "ERROR : ComputeUnit error: STDERR: unit stderr contains binary data -- use file staging directives, STDOUT: unit stderr contains binary data -- use file staging directives"

This is the STDERR of the failing CU:

pc-numpy(3):ERROR:105: Unable to locate a modulefile for 'pc-numpy/1.8.0-libsci'
pc-scipy/0.13.3-libsci(11):ERROR:151: Module 'pc-scipy/0.13.3-libsci' depends on one of the module(s) 'ðÌ'
pc-scipy/0.13.3-libsci(11):ERROR:102: Tcl command execution failed: prereq pc-numpy/1.8.0-libsci

pc-coco/0.25(8):ERROR:151: Module 'pc-coco/0.25' depends on one of the module(s) 'pc-numpy/1.9.2-mkl-python3 pc-numpy/1.9.2-mkl pc-numpy/1.9.2-libsci'
pc-coco/0.25(8):ERROR:102: Tcl command execution failed: prereq pc-numpy

aprun: file pyCoCo not found
andre-merzky commented 8 years ago

Vivek, have a look at the end of line 2: depends on one of the module(s) 'ðÌ' That looks.... unexpected...

vivek-bala commented 8 years ago

Yea.. that's an error from archer. I think its trying to print a module (name) which doesn't exist.

andre-merzky commented 8 years ago

Yes, that is my assumption as well. Is it possible to reproduce the error message on command line, using the same 'module load' commands? Would it be useful to get feedback from the archer admins?

vivek-bala commented 8 years ago

Yes, it can be reproduced on command line. Only when we try to load the old modules. Do you think that is the cause of the reported error (ERROR : ComputeUnit error: STDERR: unit stderr contains binary data -- use file staging directives, STDOUT: unit stderr contains binary data -- use file staging directives)

andre-merzky commented 8 years ago

Do you think that is the cause of the reported error

Yes - that output looks very binary to me. I would advise opening a ticket...

vivek-bala commented 8 years ago

This should be fixed now. I don't see the binary output on Archer.