radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

Execution hang-up on OSG using feature/split_module_2 #1062

Closed mingtaiha closed 8 years ago

mingtaiha commented 8 years ago

I've been submitting jobs to OSG using the most recent commit of the feature/split_module_2 branch and the feature/osg_optimization branches on SAGA. I submitted a test CU to a pilot to run /bin/date, but after ~10min the script will not run.

From the log files, the pilot has entered the state ACTIVE_PENDING, which I assume is the same as PendingActive from the previous log format. The umgr hangs up in the state UMGR_SCHEDULING (for both round robin and backfilling scheduler). On OSG, the submitted job is shown to be running (using the command condor_q)

Please let me know if you there is any other information you would need. I would try to look under radical.pilot.sandbox, but OSG is not very helpful in that regard.

andre-merzky commented 8 years ago

Hi Ming,

you should see the bootstrapper output in the pilot sandbox on the submit host -- it is currently being streamed while the pilot executes. Can you please check whats up, or send it?

Thanks!

mingtaiha commented 8 years ago

Everything seems to be installing successfully, and there isn't anything in the errors. Attached are the files for reference.

Do you have a state diagram of the new implementation?

bootstrap_1.out.txt

andre-merzky commented 8 years ago

Hmm, it seems the pilot completed -- so the state change was not picked up? Can you please send pilot.0000.log.tgz from the pilot sandbox, and also the log from the client side? What code did you run?

Re state model: that is still the set of figures here: https://drive.google.com/drive/u/0/folders/0BxhScNTfghRIfjdWRVMwb054SkJXVzhOTUNWTzM4Q2xHeE5kSE4yQlVfcExleWRrY3NKQlk

mingtaiha commented 8 years ago

Here's the the tgz from the pilot sandbox and the JSON file (which I'm guessing is the log from the client side) bundled into one zip. The code I am running is a basic RP script, nothing more fancy than tutorial scripts.

json_log_files.zip

andre-merzky commented 8 years ago

Thanks for the zip! The error there shows that you used a version of RU which was out of sync with RP -- can you update that, please?

With 'client log' I meant the logger output with RADICAL_PILOT_VERBOSE=DEBUG.

mingtaiha commented 8 years ago

This is a bit odd, but I can't seem to run RP with RADICAL_PILOT_VERBOSE=DEBUG and get the logger output.

andre-merzky commented 8 years ago

If you are running one of the examples, you may want to look out for the environment settings near the top of the files ;)

mingtaiha commented 8 years ago

That's not the problem. I've actually run one of the example scripts and the only client side messages I got were INFO level. Even when I set RADICAL_PILOT_VERBOSE=DEBUG in the example, I do not get debug messages. This problem does not occur in the devel branch, however.

mingtaiha commented 8 years ago

I got the log messages. I combed through all of them, but I didn't see anything that stood out. In the pmgr.0000.log, the pilot went to ACTIVE state before moving to CANCELLED state within a few seconds. In the umgr.000.log, the unit state went from AGENT_STAGING_INPUT_PENDING before entering the state FAILED.

Please let me know if there's anything else I can do to help.

andre-merzky commented 8 years ago

Do you see why the unit fails? Could it be that the unit failure triggers session termination and thus pilot cancellation?

mingtaiha commented 8 years ago

There were no problems which I could see. From the callback timestamps, the unit fails almost as quickly as the pilot becomes active. The bootstrap_1.out did not show any installation problems.

Here's the logs if you want to have a look.

rp.session.radical.mingtha.016994.0003.zip

andre-merzky commented 8 years ago

Would you mind also sending the agent logs? Thanks!

mingtaiha commented 8 years ago

Is that possible on OSG? The agent logs are on the remote machine, but I don't think I have a way of going to the remote machine designated to run my script.

andre-merzky commented 8 years ago

From the top of the ticket:

Hi Ming,

you should see the bootstrapper output in the pilot sandbox on the submit host -- it is currently being streamed while the pilot executes. Can you please check whats up, or send it?

When the pilot manages to terminate cleanly, you might also get some logfiles staged back. Finally, you can in some cases access the remote pilot sandbox via:

http://research.cs.wisc.edu/htcondor/manual/current/condor_ssh_to_job.html

but that only works while the pilot is still alive I believe. So, that is something you can try to do before it disappears.

mingtaiha commented 8 years ago

I managed to condor_ssh_to_job to a node but I had no idea where to look for the RP sandbox. I was landed in a directory where I only saw a .proxy file, which was a set of public/private keys. I tried to use the PID of the job to find the sandbox but was unsuccessful. I also may have triggered a large number of Permission Denied :P It would be helpful to know how and where the sandbox is created on OSG.

I'll keep trying, but I currently don't think I can get you the agent logs.

andre-merzky commented 8 years ago

If you use the condor_ssh_to_job command and it works (i.e. does not give an error),you should end up in the sandbox. So, no logfiles... Hmmm... But you should see the bootstrapper output in the sandbox on the submission host, right?

mingtaiha commented 8 years ago

The main issue is that I can't find the sandbox. Without the sandbox, I don't actually know where to start looking.

andre-merzky commented 8 years ago

Let rewind: when submitting to OSG, there are three hosts involved: your local machine (which runs the RP client code), the submit host (such as xd-login.opensciencegrid.org if your pilot resource is osg.xsede-virt-clust), and the target node, where OSG will finally execute the agent.

The first one is clear I think, that has the usual setup in terms of logging etc. The submit host will have the sandboxes also in the usual location ($HOME/radical.pilot.sandbox/<session_id>/<pilot_id>). The target node is tricky: condor will assign the sandbox, and we'll sometime have access to it via condor_ssh_to_job, and sometimes not.

Please set the pilot description's cleanup flag to False, so that the sandbox is not getting cleaned out after execution. But during execution, you should always see the sandbox.

If that is still not the case, would you mind sending me the exact script and command line you are using to test? Thanks!

mingtaiha commented 8 years ago

I'll give the script a run with cleanup=False. synapse_test.py.txt

In the meantime, here's the script. I have several pilot descriptions hard-coded, but only use the OSG pdesc.

EDIT: Grammar

mingtaiha commented 8 years ago

The log files as mentioned.

osg_login_node_logs.zip

andre-merzky commented 8 years ago

Quick note on what was discussed on the call: pilot.cancel() will, at the moment, cause a hard termination of the pilot on OSG, which will not allow it to collect log files and profiles. Please let the pilot terminate on its own to get logfiles, for example by setting its runtime to 10min or something, and, aehm, waiting ... ;)

mingtaiha commented 8 years ago

I can run jobs using RP, but on feature/split_module_2 instead of experiment/aimes and am able to run dummy workloads like which python. However, I do encounter the following error. I don't think it is a problem on the RP or SAGA layer, or whether if there is anything we can do about it

PYTHON: /wntmp/condor/execute/dir_3504364/glide_Clo2a4/execute/dir_2988352/ve_xd-login/bin/python
PIP   : /wntmp/condor/execute/dir_3504364/glide_Clo2a4/execute/dir_2988352/ve_xd-login/bin/pip
PYTHON INTERPRETER: /wntmp/condor/execute/dir_3504364/glide_Clo2a4/execute/dir_2988352/ve_xd-login/bin/python
PYTHON_VERSION    : 2.7
VE_MOD_PREFIX     : /wntmp/condor/execute/dir_3504364/glide_Clo2a4/execute/dir_2988352/ve_xd-login/lib/python2.7/site-packages
PIP installer     : /wntmp/condor/execute/dir_3504364/glide_Clo2a4/execute/dir_2988352/ve_xd-login/bin/pip
PIP version       : pip 6.0.8 from /wntmp/condor/execute/dir_3504364/glide_Clo2a4/execute/dir_2988352/ve_xd-login/lib/python2.7/site-packages (python 2.7)
activated virtenv 
VIRTENV      : /wntmp/condor/execute/dir_3504364/glide_Clo2a4/execute/dir_2988352/ve_xd-login
VE_MOD_PREFIX: /wntmp/condor/execute/dir_3504364/glide_Clo2a4/execute/dir_2988352/ve_xd-login/lib/python2.7/site-packages
RP_MOD_PREFIX: /wntmp/condor/execute/dir_3504364/glide_Clo2a4/execute/dir_2988352/ve_xd-login/rp_install/lib/python2.7/site-packages
PYTHONPATH   : /wntmp/condor/execute/dir_3504364/glide_Clo2a4/execute/dir_2988352/ve_xd-login/lib/python2.7/site-packages:

# -------------------------------------------------------------------
# 
# update setuptools
# cmd: /wntmp/condor/execute/dir_3504364/glide_Clo2a4/execute/dir_2988352/ve_xd-login/bin/pip install --upgrade setuptools==0.6c11
# 
Collecting setuptools==0.6c11
  Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'ProtocolError('Connection aborted.', error(101, 'Network is unreachable'))': /simple/setuptools/
  Retrying (Retry(total=3, connect=None, read=None, redirect=None)) after connection broken by 'ProtocolError('Connection aborted.', error(101, 'Network is unreachable'))': /simple/setuptools/
  Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by 'ProtocolError('Connection aborted.', error(101, 'Network is unreachable'))': /simple/setuptools/
  Retrying (Retry(total=1, connect=None, read=None, redirect=None)) after connection broken by 'ProtocolError('Connection aborted.', error(101, 'Network is unreachable'))': /simple/setuptools/
  retrying (Retry(total=0, connect=None, read=None, redirect=None)) after connection broken by 'ProtocolError('Connection aborted.', error(101, 'Network is unreachable'))': /simple/setuptools/
marksantcroos commented 8 years ago

Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'ProtocolError('Connection aborted.', error(101, 'Network is unreachable'))': /simple/setuptools/

This looks like a site without outbound connectivity. Which site was it, do you know?

mingtaiha commented 8 years ago

I'm not sure which site it was. I'll keep an eye out. We can close this issue seeing as how the experiment/aimes branch derived from the feature/split_module_2 branch can execute jobs on OSG