radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

Pilot is declared dead while it is still running - its CU are killed #444

Closed mturilli closed 9 years ago

mturilli commented 9 years ago

Scenario: 2-stages workflow of the AIMES demo. The single CU of the second stage fails. From the output - weak indicator - it appears that the pilot where the CU is running is canceled before the CU has the opportunity to terminate.

Session: 547b37db20a6417f39f30ffa

From the pilot-547b37e620a6417f39f30ffe AGENT.STDERR :

2014:11:30 07:32:15 47916  MainThread   radical.pilot.agent   : [DEBUG   ] Created launch_script: /home/mturilli/radical.pilot.sandbox/pilot-547b37e620a6417f39f30ffe/unit-547b387520a6417f39f31016/radical_pilot_cu_launch_script-M6AN7Y.sh
2014:11:30 07:32:15 47916  MainThread   radical.pilot.agent   : [INFO    ] Launching task 547b387520a6417f39f31016 via /usr/bin/ssh -o StrictHostKeyChecking=no trestles-4-19 /home/mturilli/radical.pilot.sandbox/pilot-547b37e620a6417f39f30ffe/unit-547b387520a6417f39f31016/radical_pilot_cu_launch_script-M6AN7Y.sh in /home/mturilli/radical.pilot.sandbox/pilot-547b37e620a6417f39f30ffe/unit-547b387520a6417f39f31016
2014:11:30 07:32:35 47911  Thread-1     radical.pilot.agent   : [INFO    ] Received Cancel Pilot command.
2014:11:30 07:32:35 47911  Thread-1     radical.pilot.agent   : [WARNING ] CANCEL received. Terminating.

From the unit-547b387520a6417f39f31016 STDERR:

Killed by signal 15.

From debug output:

2014:11:30 15:30:00 radical.pilot.MainProcess: [INFO    ] Launching ComputePilot {u'state': u'PendingLaunch', u'commands': [], u'description': {u'project': u'unc102', u'resource': u'xsede.trestles', u'queue': None, u'sandbox': None, u'cleanup': True, u'pilot_agent_priv': None, u'access_schema': None, u'memory': None, u'cores': 5, u'runtime': 15}, u'sagajobid': None, u'started': None, u'cores_per_node': None, u'output_transfer_started': None, u'sandbox': u'sftp://trestles.sdsc.xsede.org/home/mturilli/radical.pilot.sandbox/pilot-547b37e620a6417f39f30ffe/', u'submitted': datetime.datetime(2014, 11, 30, 15, 29, 45, 601000), u'output_transfer_finished': None, u'finished': None, u'pilotmanager': u'547b37db20a6417f39f30ffb', u'unitmanager': u'547b37f020a6417f39f31001', u'statehistory': [{u'timestamp': datetime.datetime(2014, 11, 30, 15, 29, 45, 600000), u'state': u'PendingLaunch'}], u'heartbeat': None, u'input_transfer_started': None, u'_id': ObjectId('547b37e620a6417f39f30ffe'), u'input_transfer_finished': None, u'nodes': None, u'log': []}

2014:11:30 15:30:00 radical.pilot.MainProcess: [INFO    ] ComputePilot '547b37e620a6417f39f30ffe' state changed from 'PendingLaunch' to 'Launching'.

2014:11:30 15:30:27 radical.pilot.MainProcess: [INFO    ] ComputePilot '547b37e620a6417f39f30ffe' state changed from 'Launching' to 'PendingActive'.

2014:11:30 15:30:42 radical.pilot.MainProcess: [INFO    ] pilot 547b37e620a6417f39f30ffe *assumed* alive and well (1)

2014:11:30 15:31:18 radical.pilot.MainProcess: [INFO    ] ComputePilot '547b37e620a6417f39f30ffe' state changed from 'PendingActive' to 'Active'.

[some units are scheduled and executed on the pilot]

2014:11:30 15:31:35 radical.pilot.MainProcess: [INFO    ] pilot 547b37e620a6417f39f30ffe *assumed* alive and well (2)

[some more units are scheduled and executed on the pilot]

2014:11:30 15:32:05 radical.pilot.MainProcess: [INFO    ] Creating ComputeUnit sandbox directory sftp://trestles.sdsc.xsede.org/home/mturilli/radical.pilot.sandbox/pilot-547b37e620a6417f39f30ffe//unit-547b387520a6417f39f31016.

2014:11:30 15:32:05 32569  InputFileTransferWorker-2 saga.ShellFile        : [DEBUG   ] run_sync: mkdir -p / && cd / &&  mkdir -p '/home/mturilli/radical.pilot.sandbox/pilot-547b37e620a6417f39f30ffe/unit-547b387520a6417f39f31016'

2014:11:30 15:32:05 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '547b387520a6417f39f31016' state changed from 'PendingInputStaging' to 'StagingInput'.

2014:11:30 15:32:05 radical.pilot.MainProcess: [DEBUG   ] Transferring input file file://localhost//home/mturilli/AIMES_demo_SC2014/Stage_1_Output/Stage_1_Output_0_1 -> sftp://trestles.sdsc.xsede.org/home/mturilli/radical.pilot.sandbox/pilot-547b37e620a6417f39f30ffe//unit-547b387520a6417f39f31016/Stage_1_Output/Stage_1_Output_0_1
2014:11:30 15:32:05 radical.pilot.MainProcess: [DEBUG   ] saga.fs.File ('file://localhost//home/mturilli/AIMES_demo_SC2014/Stage_1_Output/Stage_1_Output_0_1')
2014:11:30 15:32:05 32569  InputFileTransferWorker-2 saga.ShellFile        : [DEBUG   ] run_sync: mkdir -p /home/mturilli/radical.pilot.sandbox/pilot-547b37e620a6417f39f30ffe/unit-547b387520a6417f39f31016/Stage_1_Output/
2014:11:30 15:32:05 radical.pilot.MainProcess: [DEBUG   ] write: [   32] [  212] (mput  "//home/mturilli/AIMES_demo_SC2014/Stage_1_Output/Stage_1_Output_0_1" "/home/mturilli/radical.pilot.sandbox/pilot-547b37e620a6417f39f30ffe/unit-547b387520a6417f39f31016/Stage_1_Output/Stage_1_Output_0_1" \n\n)
2014:11:30 15:32:05 radical.pilot.MainProcess: [DEBUG   ] read : [   32] [  216] (mput  "//home/mturilli/AIMES_demo_SC2014/Stage_1_Output/Stage_1_Output_0_1" "/home/mturilli/radical.pilot.sandbox/pilot-547b37e620a6417f39f30ffe/unit-547b387520a6417f39f31016/Stage_1_Output/Stage_1_Output_0_1" \n)
2014:11:30 15:32:05 radical.pilot.MainProcess: [DEBUG   ] read : [   32] [  214] (Uploading //home/mturilli/AIMES_demo_SC2014/Stage_1_Output/Stage_1_Output_0_1 to /home/mturilli/radical.pilot.sandbox/pilot-547b37e620a6417f39f30ffe/unit-547b387520a6417f39f31016/Stage_1_Output/Stage_1_Output_0_1\n)
2014:11:30 15:32:05 radical.pilot.MainProcess: [DEBUG   ] read : [   32] [   79] (//home/mturilli/AIMES_demo_SC2014/Stage_1_Out   0%    0     0.0KB/s   --:-- ETA)
2014:11:30 15:32:05 radical.pilot.MainProcess: [DEBUG   ] read : [   32] [   81] (//home/mturilli/AIMES_demo_SC2014/Stage_1_Out 100%   20KB  19.5KB/s   00:00    \n)
2014:11:30 15:32:06 radical.pilot.MainProcess: [DEBUG   ] read : [   32] [   14] (sftp> \nsftp> )
2014:11:30 15:32:06 radical.pilot.MainProcess: [DEBUG   ] copy done: ['mput', 'Uploading', '//home/mturilli/AIMES_demo_SC2014/Stage_1_Out', 'sftp>']

[Transfer all 20 input files successfully]

2014:11:30 15:32:35 radical.pilot.MainProcess: [INFO    ] Performing periodical health check for 547b37e620a6417f39f30ffe (SAGA job id [pbs+ssh://trestles.sdsc.xsede.org/]-[2360353])

2014:11:30 15:32:35 32569  PilotLauncherWorker-1 saga.PTYShell         : [DEBUG   ] run_sync: unset GREP_OPTIONS; /opt/torque/bin/qstat -f1 2360353 | grep -E -i '(job_state)|(exec_host)|(exit_status)|(ctime)|(start_time)|(comp_time)|(stime)|(qtime)|(mtime)'
2014:11:30 15:32:35 radical.pilot.MainProcess: [DEBUG   ] write: [   42] [  164] (unset GREP_OPTIONS; /opt/torque/bin/qstat -f1 2360353 | grep -E -i '(job_state)|(exec_host)|(exit_status)|(ctime)|(start_time)|(comp_time)|(stime)|(qtime)|(mtime)'\n)
2014:11:30 15:32:35 radical.pilot.MainProcess: [DEBUG   ] read : [   42] [  273] (    job_state = R\n    ctime = Sun Nov 30 07:30:04 2014\n    exec_host = trestles-4-19/0+trestles-4-19/1+trestles-4-19/2+trestles-4-19/3+trestles-4-19/4\n    mtime = Sun Nov 30 07:30:48 2014\n    qtime = Sun Nov 30 07:30:27 2014\n    start_time = Sun Nov 30 07:30:51 2014\n)
2014:11:30 15:32:35 radical.pilot.MainProcess: [DEBUG   ] read : [   42] [   10] (PROMPT-0->)
2014:11:30 15:32:35 radical.pilot.MainProcess: [WARNING ] could not reconnect to pilot for state check (global name 'curr_info' is not defined)
2014:11:30 15:32:35 radical.pilot.MainProcess: [ERROR   ] giving up after 10 attempts
2014:11:30 15:32:35 radical.pilot.MainProcess: [ERROR   ] Could not reconnect to pilot 547b37e620a6417f39f30ffe multiple times - giving up
2014:11:30 15:32:35 radical.pilot.MainProcess: [ERROR   ] pilot 547b37e620a6417f39f30ffe declared dead

2014:11:30 15:32:35 radical.pilot.MainProcess: [INFO    ] ComputePilot '547b37e620a6417f39f30ffe' state changed from 'Active' to 'Failed'.
2014:11:30 15:32:35 radical.pilot.MainProcess: [INFO    ] RUN ComputeUnit '547b387520a6417f39f31016' state changed from 'Executing' to 'Failed'.
2014:11:30 15:32:35 radical.pilot.MainProcess: [INFO    ] [SchedulerCallback]: Computeunit 547b387520a6417f39f31016 changed to Failed
2014:11:30 15:32:35 radical.pilot.MainProcess: [WARNING ] cannot handle unit 547b387520a6417f39f31016 cb for pilot 547b37e620a6417f39f30ffe (pilot is gone)
2014:11:30 15:32:35 radical.pilot.MainProcess: [ERROR   ] error in unit callback for backfiller (u'547b37e620a6417f39f30ffe') - ignored

Full log available on request.

andre-merzky commented 9 years ago

Its likely that this line is responsible:

2014:11:30 15:32:35 radical.pilot.MainProcess: [WARNING ] could not reconnect to pilot for state check (global name 'curr_info' is not defined)

That pointed to an error on the saga level job monitoring thread -- its now fixed in the SAGA devel branch.