radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

OSG support works with 'deceiving' fail messages #956

Closed mturilli closed 8 years ago

mturilli commented 8 years ago

RP 0.38RC1

python getting_started_osg.py osg.xsede-virt-clust successfully executes the CUs. Upon shutting down it shows failed pilots. This does not work well with analysis code that looks for failed pilots.

2016-01-25 19:00:56,798: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : pmgr    pmgr.0000 cancels  pilot  pilot.0000
2016-01-25 19:00:56,799: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : pmgr    pmgr.0000 cancels  pilot  pilot.0001
2016-01-25 19:00:56,823: radical.pilot       : MainProcess                     : MainThread     : INFO    : Sent 'COMMAND_CANCEL_PILOT' command to pilots ['pilot.0000', 'pilot.0001'].
2016-01-25 19:00:56,823: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : delay to actively cancel pilot pilot.0000: state Active
2016-01-25 19:00:56,824: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : delay to actively cancel pilot pilot.0001: state Active
 -2016-01-25 19:00:59,386: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : uworker Thread-3 stops   itransfer InputFileTransferWorker-1
2016-01-25 19:00:59,386: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : itransfer InputFileTransferWorker-1 stopping
2016-01-25 19:00:59,386: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : itransfer InputFileTransferWorker-1 stopped
2016-01-25 19:00:59,386: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : uworker Thread-3 stopped itransfer InputFileTransferWorker-1
2016-01-25 19:00:59,386: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : uworker Thread-3 stops   itransfer InputFileTransferWorker-2
2016-01-25 19:00:59,386: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : itransfer InputFileTransferWorker-2 stopping
2016-01-25 19:00:59,387: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : itransfer InputFileTransferWorker-2 stopped
2016-01-25 19:00:59,387: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : uworker Thread-3 stopped itransfer InputFileTransferWorker-2
2016-01-25 19:00:59,387: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : uworker Thread-3 stops   otransfer OutputFileTransferWorker-1
2016-01-25 19:00:59,387: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : otransfer OutputFileTransferWorker-1 stopping
2016-01-25 19:00:59,387: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : otransfer OutputFileTransferWorker-1 stopped
2016-01-25 19:00:59,387: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : uworker Thread-3 stopped otransfer OutputFileTransferWorker-1
2016-01-25 19:00:59,387: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : uworker Thread-3 stops   otransfer OutputFileTransferWorker-2
2016-01-25 19:00:59,387: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : otransfer OutputFileTransferWorker-2 stopping
2016-01-25 19:00:59,387: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : otransfer OutputFileTransferWorker-2 stopped
2016-01-25 19:00:59,387: radical.pilot       : MainProcess                     : Thread-3       : DEBUG   : uworker Thread-3 stopped otransfer OutputFileTransferWorker-2
\2016-01-25 19:01:05,856: radical.pilot       : MainProcess                     : Thread-1       : INFO    : ComputePilot 'pilot.0000' state changed from 'Active' to 'Failed'.
[Callback]: ComputePilot 'pilot.0000' state: Failed.
2016-01-25 19:01:05,856: radical.pilot       : MainProcess                     : Thread-1       : DEBUG   : [SchedulerCallback]: ComputePilot pilot.0000 changed to Failed
\2016-01-25 19:01:06,967: radical.pilot       : MainProcess                     : Thread-1       : INFO    : ComputePilot 'pilot.0001' state changed from 'Active' to 'Failed'.
[Callback]: ComputePilot 'pilot.0001' state: Failed.
marksantcroos commented 8 years ago

ACK. Note that this also happens with other resources, so with the current state of things I would be careful with giving too much value to the exit state of the pilot.

mturilli commented 8 years ago

ACK. When using multiple pilots on multiple resources it rapidly becomes critical to check for the exit state of the pilot. Is this going to be addressed in the next release with the agent refactoring?

andre-merzky commented 8 years ago

February release targets client refactoring, agent refactoring is done. But that problem is contained on the agent end of things indeed, and I'll try to address it in February.

ibethune commented 8 years ago

Hi RP team, it would be great if this could be addressed ASAP, since we can't do benchmarking of the ExTASY code until RP can cleanly shut down. Don't mind taking from a devel branch etc if your release will be later on, but a swift fix would be appreciated...

andre-merzky commented 8 years ago

Ack!

andre-merzky commented 8 years ago

Fix has been merged.