radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

Invalid State Transition with pilots on OSG #1495

Closed mingtaiha closed 6 years ago

mingtaiha commented 6 years ago

Pilots running on OSG fail, which is common. However, I get the following error (invalid state transistion from FAILED to DONE state).

Traceback (most recent call last):
  File "/home/mingtha/ve/ve.exec_model_exp_testing/local/lib/python2.7/site-packages/radical/utils/threads.py", line 375, in _run
    if not self.work_cb():
  File "/home/mingtha/ve/ve.exec_model_exp_testing/local/lib/python2.7/site-packages/radical/pilot/utils/component.py", line 1122, in work_cb
    ret = self._cb(topic=topic, msg=m)
  File "/home/mingtha/ve/ve.exec_model_exp_testing/local/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 288, in _state_sub_cb
    if not self._update_pilot(thing, publish=False):
  File "/home/mingtha/ve/ve.exec_model_exp_testing/local/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 315, in _update_pilot
    target, passed = rps._pilot_state_progress(pid, current, target)
  File "/home/mingtha/ve/ve.exec_model_exp_testing/local/lib/python2.7/site-packages/radical/pilot/states.py", line 68, in _pilot_state_progress
    raise ValueError('invalid transition for %s: %s -> %s' % (pid, current, target))
ValueError: invalid transition for pilot.0016: FAILED -> DONE
2017-11-04 01:57:53,531: pmgr.0000           : MainProcess                     : pmgr.0000.subscriber._state_sub_cb: INFO    : put message: [pmgr.0000.subscriber._state_sub_cb.thread] ValueError('invalid transition for pilot.0016: FAILED -> DONE',)
2017-11-04 01:57:53,531: pmgr.0000           : MainProcess                     : pmgr.0000.subscriber._state_sub_cb: DEBUG   : ru_finalize_child (NOOP)
2017-11-04 01:57:53,533: pmgr.0000           : MainProcess                     : pmgr.0000.subscriber._state_sub_cb: INFO    : put message: [pmgr.0000.subscriber._state_sub_cb.thread] terminating                                                                    
2017-11-04 01:57:54,774: pmgr.0000           : MainProcess                     : pmgr.0000.idler._state_pull_cb: DEBUG   : pilot.0063 calls cb <bound method UnitManager._pilot_state_cb of <UnitManager(umgr.0000, initial)>>
2017-11-04 01:57:54,774: pmgr.0000           : MainProcess                     : pmgr.0000.idler._state_pull_cb: DEBUG   : pilot.0063 calls cb <bound method ComputePilot._default_state_cb of ['pilot.0063', 'osg.xsede-virt-clust', u'PMGR_ACTIVE']>
2017-11-04 01:57:54,775: pmgr.0000           : MainProcess                     : pmgr.0000.idler._state_pull_cb: INFO    : [Callback]: pilot pilot.0063 state: PMGR_ACTIVE.
2017-11-04 01:57:54,775: pmgr.0000           : MainProcess                     : pmgr.0000.idler._state_pull_cb: DEBUG   : pmgr calls cb pilot.0063 for <function pilot_state_cb at 0x7f0364b89500>
2017-11-04 01:57:54,775: pmgr.0000           : MainProcess                     : pmgr.0000.idler._state_pull_cb: DEBUG   : advance bulk size: 1 [False, True]
2017-11-04 01:57:54,775: pmgr.0000           : MainProcess                     : pmgr.0000.idler._state_pull_cb: INFO    : pilot pilot.0063 is PMGR_ACTIVE: None [None]
2017-11-04 01:57:56,325: pmgr.0000           : MainProcess                     : pmgr.0000.idler._state_pull_cb: ERROR   : abort: ValueError(u'invalid transition for pilot.0016: FAILED -> DONE',)
Traceback (most recent call last):
  File "/home/mingtha/ve/ve.exec_model_exp_testing/local/lib/python2.7/site-packages/radical/utils/threads.py", line 375, in _run 
    if not self.work_cb():
  File "/home/mingtha/ve/ve.exec_model_exp_testing/local/lib/python2.7/site-packages/radical/pilot/utils/component.py", line 952, in work_cb
    ret = self._cb()
  File "/home/mingtha/ve/ve.exec_model_exp_testing/local/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 256, in _state_pull_cb
    if not self._update_pilot(pilot_dict, publish=True):
  File "/home/mingtha/ve/ve.exec_model_exp_testing/local/lib/python2.7/site-packages/radical/pilot/pilot_manager.py", line 315, in _update_pilot
    target, passed = rps._pilot_state_progress(pid, current, target)
  File "/home/mingtha/ve/ve.exec_model_exp_testing/local/lib/python2.7/site-packages/radical/pilot/states.py", line 68, in _pilot_state_progress
    raise ValueError('invalid transition for %s: %s -> %s' % (pid, current, target))
ValueError: invalid transition for pilot.0016: FAILED -> DONE

My radical-stack:

  python               : 2.7.12
  pythonpath           : 
  virtualenv           : /home/mingtha/ve/ve.exec_model_exp_testing

  radical.analytics    : v0.45.2-86-g99480a1@rc-v0.46.3
  radical.pilot        : 0.47-v0.46.2-182-gc994bc73@rc-v0.46.3
  radical.utils        : 0.47-v0.46-73-gd580ab1@rc-v0.46.3
  saga                 : 0.47-v0.46-32-ga2f9dedc@rc-v0.46.3
mingtaiha commented 6 years ago

Addition:

For the record, the session continues to run. But I think it messes my ability to use RA to parse logs, but that is for another ticket that I will open in the near future after additional testing

EDIT: It gave me some problems initially, but I was able to overcome them, so ignore this plz