radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

Units with no input/output fails #947

Closed mturilli closed 8 years ago

mturilli commented 8 years ago

RP 0.38 installed from pip

Running a Bo 8 T, 1 core, 1 minute on Stampede. The kernel of each task is the skeleton executable. I successfully tested the execution line of each CU on Stampede. The CUs do not fail and indeed STDERR of each CU is empty.

When specifying staging directives CUs successfully execute:

Example of a CU description:

{
    'kernel': None, 
    'executable': 'task', 
    'name': 'Stage_1_Stage_1_0', 
    'restartable': False, 
    'output_staging': [{'source': u'Stage_1_Output/Stage_1_Output_0_1', 
                                 'flags': 'CreateParents', 
                                 'target': u'Stage_1_Output/Stage_1_Output_0_1'}], 
    'stdout': None, 
    'pre_exec': [], 
    'mpi': False, 
    'environment': None, 
    'cleanup': True, 
    'arguments': ['serial', '1', '30.0', '65536', '65536', '1', '1', '0', 'Stage_1_Input/Stage_1_Input_0_1', 'Stage_1_Output/Stage_1_Output_0_1', '20000'], 
    'stderr': None,
    'cores': 1, 
    'post_exec': None, 
    'input_staging': [{'source': u'Stage_1_Input/Stage_1_Input_0_1', 
                               'flags': 'CreateParents', 
                               'target': u'Stage_1_Input/Stage_1_Input_0_1'}]
}

Pilot description:

{
    'queue': None, 
    'resource': 'xsede.stampede', 
    'exit_on_error': False, 
    'project': 'TG-MCB090174', 
    'sandbox': None, 
    'cleanup': True, 
    'access_schema': None, 
    'memory': None, 
    'cores': 8, 
    'runtime': 12
}

The following CUs instead fail:

{
   'kernel': None, 
   'executable': 'task', 
   'name': 'Stage_1_Stage_1_0', 
   'restartable': False, 
   'output_staging': None, 
   'stdout': None, 
   'pre_exec': [], 
   'mpi': False, 
   'environment': None, 
   'cleanup': True, 
   'arguments': ['serial', '1', '30.0', '65536', '65536', '0', '0', '0'], 
   'stderr': None, 
   'cores': 1, 
   'post_exec': None, 
   'input_staging': None
}

With (almost) the same pilot description:

{
    'queue': None, 
    'resource': 'xsede.stampede', 
    'exit_on_error': False, 
    'project': 'TG-MCB090174', 
    'sandbox': None, 
    'cleanup': True, 
    'access_schema': None, 
    'memory': None, 
    'cores': 8, 
    'runtime': 11
}

Note the DEBUG output where I print also STDOUT and STDERR for each CU, confirming that no error is reported:

2016-01-21 12:20:29,682: radical.pilot       : MainProcess                     : PilotLauncherWorker-1: INFO    : pilot pilot.0000 seems alive and well
2016-01-21 12:20:33,807: radical.pilot       : MainProcess                     : Thread-1       : INFO    : ComputePilot 'pilot.0000' state changed from 'PendingActive' to 'Active'.
2016-01-21 12:20:33,808: radical.pilot       : MainProcess                     : Thread-1       : INFO    : [Callback]: ComputePilot 'pilot.0000' state: Active.
2016-01-21 12:20:48,286: radical.pilot       : MainProcess                     : Thread-3       : INFO    : [Callback]: unit unit.000003 state on pilot pilot.0000: Executing.
2016-01-21 12:20:48,286: radical.pilot       : MainProcess                     : Thread-3       : INFO    : [Callback]: unit unit.000002 state on pilot pilot.0000: Executing.
2016-01-21 12:20:48,286: radical.pilot       : MainProcess                     : Thread-3       : INFO    : [Callback]: unit unit.000001 state on pilot pilot.0000: Executing.
2016-01-21 12:20:48,286: radical.pilot       : MainProcess                     : Thread-3       : INFO    : [Callback]: unit unit.000000 state on pilot pilot.0000: Executing.
2016-01-21 12:20:48,286: radical.pilot       : MainProcess                     : Thread-3       : INFO    : [Callback]: unit unit.000007 state on pilot pilot.0000: Executing.
2016-01-21 12:20:48,286: radical.pilot       : MainProcess                     : Thread-3       : INFO    : [Callback]: unit unit.000006 state on pilot pilot.0000: Executing.
2016-01-21 12:20:48,287: radical.pilot       : MainProcess                     : Thread-3       : INFO    : [Callback]: unit unit.000005 state on pilot pilot.0000: Executing.
2016-01-21 12:20:48,287: radical.pilot       : MainProcess                     : Thread-3       : INFO    : [Callback]: unit unit.000004 state on pilot pilot.0000: Executing.
2016-01-21 12:21:18,450: radical.pilot       : MainProcess                     : Thread-3       : INFO    : [Callback]: unit unit.000003 state on pilot pilot.0000: Failed.
'unit.000003' stderr: .
'unit.000003' stdout: serial: 9
sleep interval: 30.000000
.
2016-01-21 12:21:20,050: radical.pilot       : MainProcess                     : Thread-3       : INFO    : [Callback]: unit unit.000002 state on pilot pilot.0000: AgentStagingOutputPending.
2016-01-21 12:21:20,050: radical.pilot       : MainProcess                     : Thread-3       : INFO    : [Callback]: unit unit.000001 state on pilot pilot.0000: AgentStagingOutputPending.
2016-01-21 12:21:20,050: radical.pilot       : MainProcess                     : Thread-3       : INFO    : [Callback]: unit unit.000000 state on pilot pilot.0000: Failed.
'unit.000000' stderr: .
'unit.000000' stdout: serial: 9
sleep interval: 30.000000
.
2016-01-21 12:21:20,076: radical.pilot       : MainProcess                     : Thread-3       : INFO    : [Callback]: unit unit.000007 state on pilot pilot.0000: Failed.
'unit.000007' stderr: .
'unit.000007' stdout: serial: 9
sleep interval: 30.000000
.
2016-01-21 12:21:20,100: radical.pilot       : MainProcess                     : Thread-3       : INFO    : [Callback]: unit unit.000006 state on pilot pilot.0000: Failed.
'unit.000006' stderr: .
'unit.000006' stdout: serial: 9
sleep interval: 30.000000
.
2016-01-21 12:21:20,127: radical.pilot       : MainProcess                     : Thread-3       : INFO    : [Callback]: unit unit.000005 state on pilot pilot.0000: Failed.
'unit.000005' stderr: .
'unit.000005' stdout: serial: 9
sleep interval: 30.000000
.
2016-01-21 12:21:20,151: radical.pilot       : MainProcess                     : Thread-3       : INFO    : [Callback]: unit unit.000004 state on pilot pilot.0000: AgentStagingOutputPending.
2016-01-21 12:21:23,170: radical.pilot       : MainProcess                     : Thread-3       : INFO    : [Callback]: unit unit.000002 state on pilot pilot.0000: Failed.
'unit.000002' stderr: .
'unit.000002' stdout: serial: 9
sleep interval: 30.000000
.
2016-01-21 12:21:23,194: radical.pilot       : MainProcess                     : Thread-3       : INFO    : [Callback]: unit unit.000001 state on pilot pilot.0000: Failed.
'unit.000001' stderr: .
'unit.000001' stdout: serial: 9
sleep interval: 30.000000
.
2016-01-21 12:21:23,218: radical.pilot       : MainProcess                     : Thread-3       : INFO    : [Callback]: unit unit.000004 state on pilot pilot.0000: Failed.
'unit.000004' stderr: .
'unit.000004' stdout: serial: 9
sleep interval: 30.000000
.
2016-01-21 12:21:23,452: radical.pilot       : MainProcess                     : MainThread     : DEBUG   : session rp.session.Matteos-MacBook-Pro.local.mturilli.016821.0024 closing

Andre gained access to my sandbox on Stampede but noticed nothing particularly wrong with the execution. Happy to provide pointer to the sanbox location if useful.

marksantcroos commented 8 years ago

Fail is a relative term here :-)

Whats the return code of the task supposed to be?

I took the liberty to look into your sandbox (...016821.0012-pilot.0000) and see:

2016-01-21 06:22:02,261: agent_0.AgentExecutingWatcher_POPEN.0: agent_0.AgentExecutingComponent_POPEN.0: Watcher        : INFO    : Unit unit.000004 has return code 32.

Which RP will translate rightfully to fail.

mturilli commented 8 years ago

Here the radical_pilot_cu_launch_script.sh for a CU:

#!/bin/bash -l

echo script start_script `/work/02855/mturilli/radical.pilot.sandbox/rp.session.Matteos-MacBook-Pro.local.mturilli.016821.0024-pilot.0000/gtod` >> /work/02855/mturilli/radical.pilot.sandbox/rp.session.Matteos-MacBook-Pro.local.mturilli.016821.0024-pilot.0000/unit.000000/PROF

# Change to working directory for unit
cd /work/02855/mturilli/radical.pilot.sandbox/rp.session.Matteos-MacBook-Pro.local.mturilli.016821.0024-pilot.0000/unit.000000
echo script after_cd `/work/02855/mturilli/radical.pilot.sandbox/rp.session.Matteos-MacBook-Pro.local.mturilli.016821.0024-pilot.0000/gtod` >> /work/02855/mturilli/radical.pilot.sandbox/rp.session.Matteos-MacBook-Pro.local.mturilli.016821.0024-pilot.0000/unit.000000/PROF
# Environment variables
export RP_SESSION_ID=rp.session.Matteos-MacBook-Pro.local.mturilli.016821.0024 RP_PILOT_ID=pilot.0000 RP_AGENT_ID=agent_0 RP_SPAWNER_ID=agent_0.AgentExecutingComponent_POPEN.0.child RP_UNIT_ID=unit.000000
# The command to run
task "serial" "1" "30.0" "65536" "65536" "0" "0" "0" 
RETVAL=$?
echo script after_exec `/work/02855/mturilli/radical.pilot.sandbox/rp.session.Matteos-MacBook-Pro.local.mturilli.016821.0024-pilot.0000/gtod` >> /work/02855/mturilli/radical.pilot.sandbox/rp.session.Matteos-MacBook-Pro.local.mturilli.016821.0024-pilot.0000/unit.000000/PROF
# Exit the script with the return code from the command
exit $RETVAL

Here the command executed on my login shell:

$ task "serial" "1" "30.0" "65536" "65536" "0" "0" "0"
serial: 9
sleep interval: 30.000000
$

Both with a pipe or checking $? after task has exited returns 0

marksantcroos commented 8 years ago

A quick internet search of "exit code 32" seems to point to FS issues. That would correlate to the empty log files too ...

mturilli commented 8 years ago

Darn (Homer TM)! Is there a way to tell RP to use my home directory for testing whether the issue is with $WORK? Suggestions for different approaches more than welcome. I wonder though why the very same run with some input/output files presents no issues...

marksantcroos commented 8 years ago

I can't explain that difference to be honest, so in that sense it was just a long shot, not an explanation.

mturilli commented 8 years ago

Look at this one for more puzzlement? ;)

login2.stampede(24)$ sh
login2.stampede(3)$ task "serial" "1" "30.0" "65536" "65536" "0" "0" "0" 
serial: 9
sleep interval: 30.000000
login2.stampede(4)$ echo $?
32
marksantcroos commented 8 years ago

Ok, so the "task" is broken?

mturilli commented 8 years ago

Well, before it was returning 0. Since when I started a new shell, it returns 32. Puzzling.

mturilli commented 8 years ago

The issue was solved by patching task.c from the Skeleton repository forcing it to exit with a 0 code when finishing to execute successfully.