radical-cybertools / radical.repex.at

This is the github location for RepEx developed by the RADICAL team in conjunction with the York Lab.
Other
4 stars 3 forks source link

repex t-remd fails in workflow machine #67

Closed antonst closed 8 years ago

antonst commented 8 years ago

with:

2015-10-20 15:44:13,558: radical.repex       : MainProcess                     : Thread-5       : INFO    : ComputeUnit 'unit.000000' state changed to StagingOutput.
2015-10-20 15:44:14,662: radical.repex       : MainProcess                     : Thread-5       : INFO    : ComputeUnit 'unit.000002' state changed to PendingOutputStaging.
2015-10-20 15:44:14,662: radical.repex       : MainProcess                     : Thread-5       : INFO    : ComputeUnit 'unit.000001' state changed to Executing.
2015-10-20 15:44:14,662: radical.repex       : MainProcess                     : Thread-5       : INFO    : ComputeUnit 'unit.000000' state changed to Done.
2015-10-20 15:44:15,818: radical.repex       : MainProcess                     : Thread-5       : INFO    : ComputeUnit 'unit.000002' state changed to Done.
2015-10-20 15:45:02,568: radical.repex       : MainProcess                     : Thread-1       : INFO    : ComputePilot 'pilot.0000' state changed to Failed.
2015-10-20 15:45:02,570: radical.repex       : MainProcess                     : Thread-1       : ERROR   : Pilot error: [<radical.pilot.logentry.Logentry object at 0x7f49bf4f1150>, <radical.pilot.logentry.Logentry object at 0x7f49bf4f1f50>, <radical.pilot.logentry.Logentry object at 0x7f49c12c0790>, <radical.pilot.logentry.Logentry object at 0x7f49bc1fd890>, <radical.pilot.logentry.Logentry object at 0x7f49bc1fdf90>, <radical.pilot.logentry.Logentry object at 0x7f49bc1fd8d0>, <radical.pilot.logentry.Logentry object at 0x7f49bc1fd850>, <radical.pilot.logentry.Logentry object at 0x7f49bc1fdc10>, <radical.pilot.logentry.Logentry object at 0x7f49bc1fdf10>, <radical.pilot.logentry.Logentry object at 0x7f49bc1fd910>]
2015-10-20 15:45:02,570: radical.repex       : MainProcess                     : Thread-1       : ERROR   : RepEx execution FAILED.
2015-10-20 15:45:02,570: radical.pilot       : MainProcess                     : Thread-1       : ERROR   : pilot manager controller thread caught system exit -- forcing application shutdown
Traceback (most recent call last):
  File "/home/antontre/ram/lib/python2.7/site-packages/radical/pilot/controller/pilot_manager_controller.py", line 335, in run
    self.call_callbacks(pilot_id, new_state)
  File "/home/antontre/ram/lib/python2.7/site-packages/radical/pilot/controller/pilot_manager_controller.py", line 258, in call_callbacks
    cb(self._shared_data[pilot_id]['facade_object'](), new_state)
  File "/home/antontre/ram/lib/python2.7/site-packages/pilot_kernels/pilot_kernel.py", line 73, in pilot_state_cb
    sys.exit(1)
SystemExit: 1
Traceback (most recent call last):
  File "/home/antontre/ram/bin/repex-amber", line 58, in <module>
    pilot_kernel.run_simulation( replicas, pilot_object, session, md_kernel )
  File "/home/antontre/ram/lib/python2.7/site-packages/pilot_kernels/pilot_kernel_pattern_s.py", line 209, in run_simulation
    unit_manager.wait_units()
  File "/home/antontre/ram/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 697, in wait_units
    time.sleep (0.5)
KeyboardInterrupt

amber_path = "/tmp/amber14/bin/sander"

antonst commented 8 years ago

Next time I rerun it, it got further but produced this:

2015-10-20 16:01:27,930: radical.repex       : MainProcess                     : Thread-4       : INFO    : ComputeUnit 'unit.000019' state changed to AgentStagingInputPending.
2015-10-20 16:01:27,931: radical.pilot       : MainProcess                     : Thread-4       : INFO    : [Callback]: unit unit.000021 state on pilot pilot.0000: AgentStagingInputPending.
2015-10-20 16:01:27,931: radical.repex       : MainProcess                     : Thread-4       : INFO    : ComputeUnit 'unit.000021' state changed to AgentStagingInputPending.
2015-10-20 16:01:27,931: radical.pilot       : MainProcess                     : Thread-4       : INFO    : [Callback]: unit unit.000020 state on pilot pilot.0000: AgentStagingInputPending.
2015-10-20 16:01:27,931: radical.repex       : MainProcess                     : Thread-4       : INFO    : ComputeUnit 'unit.000020' state changed to AgentStagingInputPending.
2015-10-20 16:01:28,463: radical.pilot       : MainProcess                     : PilotLauncherWorker-1: INFO    : Performing periodical health check for pilot.0000 (SAGA job id [fork://localhost/]-[753856.0])
2015-10-20 16:01:28,571: radical.pilot       : MainProcess                     : PilotLauncherWorker-1: WARNING : could not reconnect to pilot for state check (failed to list jobs: (127)(LIST
sh: LIST: command not found
) (/home/antontre/ram/lib/python2.7/site-packages/saga/adaptors/shell/shell_job.py +911 (list)  :  % (ret, out))))
2015-10-20 16:01:28,571: radical.pilot       : MainProcess                     : PilotLauncherWorker-1: DEBUG   : giving up after 10 attempts
2015-10-20 16:01:28,598: radical.pilot       : MainProcess                     : PilotLauncherWorker-1: DEBUG   : Could not reconnect to pilot pilot.0000 multiple times - giving up
2015-10-20 16:01:28,598: radical.pilot       : MainProcess                     : PilotLauncherWorker-1: WARNING : pilot pilot.0000 declared dead
2015-10-20 16:01:29,118: radical.pilot       : MainProcess                     : Thread-1       : INFO    : ComputePilot 'pilot.0000' state changed from 'Active' to 'Failed'.
2015-10-20 16:01:29,119: radical.pilot       : MainProcess                     : Thread-1       : INFO    : [Callback]: ComputePilot 'pilot.0000' state: Failed.
2015-10-20 16:01:29,123: radical.pilot       : MainProcess                     : Thread-4       : INFO    : [Callback]: unit unit.000018 state on pilot pilot.0000: Executing.
2015-10-20 16:01:29,120: radical.repex       : MainProcess                     : Thread-1       : INFO    : ComputePilot 'pilot.0000' state changed to Failed.
2015-10-20 16:01:29,124: radical.repex       : MainProcess                     : Thread-4       : INFO    : ComputeUnit 'unit.000018' state changed to Executing.
2015-10-20 16:01:29,125: radical.pilot       : MainProcess                     : Thread-4       : INFO    : [Callback]: unit unit.000019 state on pilot pilot.0000: Executing.
2015-10-20 16:01:29,124: radical.repex       : MainProcess                     : Thread-1       : ERROR   : Pilot error: [<radical.pilot.logentry.Logentry object at 0x7f76d8399150>, <radical.pilot.logentry.Logentry object at 0x7f76d83ba610>, <radical.pilot.logentry.Logentry object at 0x7f76d83ba6d0>, <radical.pilot.logentry.Logentry object at 0x7f76d83bafd0>, <radical.pilot.logentry.Logentry object at 0x7f76d83ba550>, <radical.pilot.logentry.Logentry object at 0x7f76d83ba590>, <radical.pilot.logentry.Logentry object at 0x7f76d83ba710>, <radical.pilot.logentry.Logentry object at 0x7f76d83baf10>, <radical.pilot.logentry.Logentry object at 0x7f76d83ba690>, <radical.pilot.logentry.Logentry object at 0x7f76d83ba650>]
2015-10-20 16:01:29,125: radical.repex       : MainProcess                     : Thread-1       : ERROR   : RepEx execution FAILED.
2015-10-20 16:01:29,125: radical.repex       : MainProcess                     : Thread-4       : INFO    : ComputeUnit 'unit.000019' state changed to Executing.
2015-10-20 16:01:29,125: radical.pilot       : MainProcess                     : Thread-1       : ERROR   : pilot manager controller thread caught system exit -- forcing application shutdown
Traceback (most recent call last):
  File "/home/antontre/ram/lib/python2.7/site-packages/radical/pilot/controller/pilot_manager_controller.py", line 335, in run
    self.call_callbacks(pilot_id, new_state)
  File "/home/antontre/ram/lib/python2.7/site-packages/radical/pilot/controller/pilot_manager_controller.py", line 258, in call_callbacks
    cb(self._shared_data[pilot_id]['facade_object'](), new_state)
  File "/home/antontre/ram/lib/python2.7/site-packages/pilot_kernels/pilot_kernel.py", line 73, in pilot_state_cb
    sys.exit(1)
SystemExit: 1
2015-10-20 16:01:29,127: radical.pilot       : MainProcess                     : Thread-4       : INFO    : [Callback]: unit unit.000021 state on pilot pilot.0000: Allocating.
2015-10-20 16:01:29,127: radical.pilot       : MainProcess                     : Thread-1       : DEBUG   : pworker Thread-1 stops   launcher PilotLauncherWorker-1
2015-10-20 16:01:29,128: radical.repex       : MainProcess                     : Thread-4       : INFO    : ComputeUnit 'unit.000021' state changed to Allocating.
2015-10-20 16:01:29,129: radical.pilot       : MainProcess                     : Thread-1       : DEBUG   : launcher PilotLauncherWorker-1 stopping
2015-10-20 16:01:29,130: radical.pilot       : MainProcess                     : Thread-4       : INFO    : [Callback]: unit unit.000020 state on pilot pilot.0000: Allocating.
2015-10-20 16:01:29,130: radical.repex       : MainProcess                     : Thread-4       : INFO    : ComputeUnit 'unit.000020' state changed to Allocating.
Traceback (most recent call last):
  File "/home/antontre/ram/bin/repex-amber", line 58, in <module>
    pilot_kernel.run_simulation( replicas, pilot_object, session, md_kernel )
  File "/home/antontre/ram/lib/python2.7/site-packages/pilot_kernels/pilot_kernel_pattern_s.py", line 209, in run_simulation
    unit_manager.wait_units()
  File "/home/antontre/ram/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 697, in wait_units
    time.sleep (0.5)
KeyboardInterrupt
2015-10-20 16:01:29,626: radical.pilot       : MainProcess                     : Thread-1       : DEBUG   : launcher PilotLauncherWorker-1 stopped
2015-10-20 16:01:29,627: radical.pilot       : MainProcess                     : Thread-1       : DEBUG   : pworker Thread-1 stopped launcher PilotLauncherWorker-1
2015-10-20 16:02:31,153: radical.pilot       : MainProcess                     : Thread-4       : INFO    : [Callback]: unit unit.000019 state on pilot pilot.0000: AgentStagingOutputPending.
2015-10-20 16:02:31,155: radical.repex       : MainProcess                     : Thread-4       : INFO    : ComputeUnit 'unit.000019' state changed to AgentStagingOutputPending.

Strangely enough simulation continued and CU's were execution even after this point.

andre-merzky commented 8 years ago

Guys, could it be that the pilot is simply timing out?

antonst commented 8 years ago

Closing as irrelevant now