radical-cybertools / ExTASY

MDEnsemble
Other
1 stars 1 forks source link

ExTASY 0.2 new user defined kernels coam workflow fails on resources deallocation #241

Closed ashkurti closed 8 years ago

ashkurti commented 8 years ago

Error at the end:

[ExTASY_0.2-tools] ardita@moriarty 131% python extasy_amber_coco.py --RPconfig stampede.rcfg --Kconfig cocoamber.wcfg |& tee extasy.log

================================================================================
 EnsembleMD (0.3.14-27-g65bc062)
================================================================================

Starting Allocation                                                           ok
Verifying pattern                                                             ok
Starting pattern execution                                                    ok
--------------------------------------------------------------------------------
Executing simulation-analysis loop with 2 iterations on 16 allocated core(s) on 'xsede.stampede'

Job waiting on queue...
Job is now running !
Iteration 1: Waiting for 16 simulation tasks: custom.amber to complete      done
Iteration 1: Waiting for 16 simulation tasks: custom.amber to complete      done
Iteration 1: Waiting for analysis tasks: custom.coco to complete            done
Iteration 1: Waiting for analysis tasks: custom.tleap to complete           done
Iteration 2: Waiting for 16 simulation tasks: custom.amber to complete      done
Iteration 2: Waiting for 16 simulation tasks: custom.amber to complete      done
Iteration 2: Waiting for analysis tasks: custom.coco to complete            done
Iteration 2: Waiting for analysis tasks: custom.tleap to complete           done
--------------------------------------------------------------------------------
Pattern execution successfully finished

Starting Deallocation..
2016-02-11 11:42:26,592: radical.enmd.SingleClusterEnvironment: MainProcess                     : Thread-1       : ERROR   : Resource error:
2016-02-11 11:42:26,593: radical.enmd.SingleClusterEnvironment: MainProcess                     : Thread-1       : ERROR   : Pattern execution FAILED.
2016-02-11 11:42:26,593: radical.pilot       : MainProcess                     : Thread-1       : ERROR   : sys.exit from callback
Traceback (most recent call last):
  File "/users/ardita/extasy_tests/ExTASY_0.2-tools/lib/python2.7/site-packages/radical/pilot/controller/pilot_manager_controller.py", line 258, in call_callbacks
    cb(self._shared_data[pilot_id]['facade_object'](), new_state)
  File "/users/ardita/extasy_tests/ExTASY_0.2-tools/lib/python2.7/site-packages/radical/ensemblemd/single_cluster_environment.py", line 168, in pilot_state_cb
    sys.exit(1)
SystemExit: 1
Traceback (most recent call last):
  File "extasy_amber_coco.py", line 209, in <module>
    cluster.deallocate()
  File "/users/ardita/extasy_tests/ExTASY_0.2-tools/lib/python2.7/site-packages/radical/ensemblemd/single_cluster_environment.py", line 117, in deallocate
    self._session.close(cleanup=self._cleanup)
  File "/users/ardita/extasy_tests/ExTASY_0.2-tools/lib/python2.7/site-packages/radical/pilot/session.py", line 304, in close
    pmgr.close (terminate=terminate)
  File "/users/ardita/extasy_tests/ExTASY_0.2-tools/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 175, in close
    self.cancel_pilots()
  File "/users/ardita/extasy_tests/ExTASY_0.2-tools/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 579, in cancel_pilots
    self._worker.register_cancel_pilots_request(pilot_ids=pilot_ids)
  File "/users/ardita/extasy_tests/ExTASY_0.2-tools/lib/python2.7/site-packages/radical/pilot/controller/pilot_manager_controller.py", line 608, in register_cancel_pilots_request
    time.sleep(0.3)
KeyboardInterrupt
ashkurti commented 8 years ago

This happens while running the CoCo/Amber workflow on both Stampede and Archer.

vivek-bala commented 8 years ago

Duplicate of #239

ibethune commented 8 years ago

This is fixed in the RP devel branch:

pip install --upgrade git+https://github.com/radical-cybertools/radical.pilot.git@devel#egg=radical.pilot

and you should be good to go...

ashkurti commented 8 years ago

Ok, I will try this out installing everything from the devel branch then. I thought I should use the master branch which I did. I will keep people posted.

ibethune commented 8 years ago

Use the master branch from ensemblemd, but to get the relevant RP fix you need the devel branch of RP (possibly there will be a hotfix release soon).

ashkurti commented 8 years ago

This is what I do:

export ENMD_INSTALL_VERSION="master"
pip install --upgrade git+https://github.com/radical-cybertools/radical.ensemblemd.git@$ENMD_INSTALL_VERSION#egg=radical.ensemblemd

And what would you do with radical pilot? I thought we install radical pilot automatically now with ensemblemd.

vivek-bala commented 8 years ago

Yes, we do. But it only installs the released version. The fix to the shutdown issue is in the devel version, so you need to update rp with the command Iain mentioned. On Feb 11, 2016 11:53 AM, "ashkurti" notifications@github.com wrote:

This is what I do:

export ENMD_INSTALL_VERSION="master" pip install --upgrade git+https://github.com/radical-cybertools/radical.ensemblemd.git@$ENMD_INSTALL_VERSION#egg=radical.ensemblemd

And what would you do with radical pilot? I thought we install radical pilot automatically now with ensemblemd.

— Reply to this email directly or view it on GitHub https://github.com/radical-cybertools/ExTASY/issues/241#issuecomment-182950333 .

ashkurti commented 8 years ago

Ok, I works on Stampede now, while on ARCHER it is taking a while in the queue. I will close this as soon as I see it working fine on ARCHER.