Closed mingtaiha closed 8 years ago
Hi Ming,
you should see the bootstrapper output in the pilot sandbox on the submit host -- it is currently being streamed while the pilot executes. Can you please check whats up, or send it?
Thanks!
Everything seems to be installing successfully, and there isn't anything in the errors. Attached are the files for reference.
Do you have a state diagram of the new implementation?
Hmm, it seems the pilot completed -- so the state change was not picked up? Can you please send pilot.0000.log.tgz from the pilot sandbox, and also the log from the client side? What code did you run?
Re state model: that is still the set of figures here: https://drive.google.com/drive/u/0/folders/0BxhScNTfghRIfjdWRVMwb054SkJXVzhOTUNWTzM4Q2xHeE5kSE4yQlVfcExleWRrY3NKQlk
Here's the the tgz from the pilot sandbox and the JSON file (which I'm guessing is the log from the client side) bundled into one zip. The code I am running is a basic RP script, nothing more fancy than tutorial scripts.
Thanks for the zip! The error there shows that you used a version of RU which was out of sync with RP -- can you update that, please?
With 'client log' I meant the logger output with RADICAL_PILOT_VERBOSE=DEBUG
.
This is a bit odd, but I can't seem to run RP with RADICAL_PILOT_VERBOSE=DEBUG
and get the logger output.
If you are running one of the examples, you may want to look out for the environment settings near the top of the files ;)
That's not the problem. I've actually run one of the example scripts and the only client side messages I got were INFO
level. Even when I set RADICAL_PILOT_VERBOSE=DEBUG
in the example, I do not get debug messages. This problem does not occur in the devel branch, however.
I got the log messages. I combed through all of them, but I didn't see anything that stood out. In the pmgr.0000.log, the pilot went to ACTIVE
state before moving to CANCELLED
state within a few seconds. In the umgr.000.log, the unit state went from AGENT_STAGING_INPUT_PENDING
before entering the state FAILED
.
Please let me know if there's anything else I can do to help.
Do you see why the unit fails? Could it be that the unit failure triggers session termination and thus pilot cancellation?
There were no problems which I could see. From the callback timestamps, the unit fails almost as quickly as the pilot becomes active. The bootstrap_1.out did not show any installation problems.
Here's the logs if you want to have a look.
Would you mind also sending the agent logs? Thanks!
Is that possible on OSG? The agent logs are on the remote machine, but I don't think I have a way of going to the remote machine designated to run my script.
From the top of the ticket:
Hi Ming,
you should see the bootstrapper output in the pilot sandbox on the submit host -- it is currently being streamed while the pilot executes. Can you please check whats up, or send it?
When the pilot manages to terminate cleanly, you might also get some logfiles staged back. Finally, you can in some cases access the remote pilot sandbox via:
http://research.cs.wisc.edu/htcondor/manual/current/condor_ssh_to_job.html
but that only works while the pilot is still alive I believe. So, that is something you can try to do before it disappears.
I managed to condor_ssh_to_job
to a node but I had no idea where to look for the RP sandbox. I was landed in a directory where I only saw a .proxy
file, which was a set of public/private keys. I tried to use the PID of the job to find the sandbox but was unsuccessful. I also may have triggered a large number of Permission Denied
:P It would be helpful to know how and where the sandbox is created on OSG.
I'll keep trying, but I currently don't think I can get you the agent logs.
If you use the condor_ssh_to_job
command and it works (i.e. does not give an error),you should end up in the sandbox. So, no logfiles... Hmmm... But you should see the bootstrapper output in the sandbox on the submission host, right?
The main issue is that I can't find the sandbox. Without the sandbox, I don't actually know where to start looking.
Let rewind: when submitting to OSG, there are three hosts involved: your local machine (which runs the RP client code), the submit host (such as xd-login.opensciencegrid.org
if your pilot resource is osg.xsede-virt-clust
), and the target node, where OSG will finally execute the agent.
The first one is clear I think, that has the usual setup in terms of logging etc. The submit host will have the sandboxes also in the usual location ($HOME/radical.pilot.sandbox/<session_id>/<pilot_id>
). The target node is tricky: condor will assign the sandbox, and we'll sometime have access to it via condor_ssh_to_job
, and sometimes not.
Please set the pilot description's cleanup
flag to False
, so that the sandbox is not getting cleaned out after execution. But during execution, you should always see the sandbox.
If that is still not the case, would you mind sending me the exact script and command line you are using to test? Thanks!
I'll give the script a run with cleanup=False
.
synapse_test.py.txt
In the meantime, here's the script. I have several pilot descriptions hard-coded, but only use the OSG pdesc.
EDIT: Grammar
The log files as mentioned.
Quick note on what was discussed on the call: pilot.cancel()
will, at the moment, cause a hard termination of the pilot on OSG, which will not allow it to collect log files and profiles. Please let the pilot terminate on its own to get logfiles, for example by setting its runtime to 10min or something, and, aehm, waiting ... ;)
I can run jobs using RP, but on feature/split_module_2
instead of experiment/aimes
and am able to run dummy workloads like which python
. However, I do encounter the following error. I don't think it is a problem on the RP or SAGA layer, or whether if there is anything we can do about it
PYTHON: /wntmp/condor/execute/dir_3504364/glide_Clo2a4/execute/dir_2988352/ve_xd-login/bin/python
PIP : /wntmp/condor/execute/dir_3504364/glide_Clo2a4/execute/dir_2988352/ve_xd-login/bin/pip
PYTHON INTERPRETER: /wntmp/condor/execute/dir_3504364/glide_Clo2a4/execute/dir_2988352/ve_xd-login/bin/python
PYTHON_VERSION : 2.7
VE_MOD_PREFIX : /wntmp/condor/execute/dir_3504364/glide_Clo2a4/execute/dir_2988352/ve_xd-login/lib/python2.7/site-packages
PIP installer : /wntmp/condor/execute/dir_3504364/glide_Clo2a4/execute/dir_2988352/ve_xd-login/bin/pip
PIP version : pip 6.0.8 from /wntmp/condor/execute/dir_3504364/glide_Clo2a4/execute/dir_2988352/ve_xd-login/lib/python2.7/site-packages (python 2.7)
activated virtenv
VIRTENV : /wntmp/condor/execute/dir_3504364/glide_Clo2a4/execute/dir_2988352/ve_xd-login
VE_MOD_PREFIX: /wntmp/condor/execute/dir_3504364/glide_Clo2a4/execute/dir_2988352/ve_xd-login/lib/python2.7/site-packages
RP_MOD_PREFIX: /wntmp/condor/execute/dir_3504364/glide_Clo2a4/execute/dir_2988352/ve_xd-login/rp_install/lib/python2.7/site-packages
PYTHONPATH : /wntmp/condor/execute/dir_3504364/glide_Clo2a4/execute/dir_2988352/ve_xd-login/lib/python2.7/site-packages:
# -------------------------------------------------------------------
#
# update setuptools
# cmd: /wntmp/condor/execute/dir_3504364/glide_Clo2a4/execute/dir_2988352/ve_xd-login/bin/pip install --upgrade setuptools==0.6c11
#
Collecting setuptools==0.6c11
Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'ProtocolError('Connection aborted.', error(101, 'Network is unreachable'))': /simple/setuptools/
Retrying (Retry(total=3, connect=None, read=None, redirect=None)) after connection broken by 'ProtocolError('Connection aborted.', error(101, 'Network is unreachable'))': /simple/setuptools/
Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by 'ProtocolError('Connection aborted.', error(101, 'Network is unreachable'))': /simple/setuptools/
Retrying (Retry(total=1, connect=None, read=None, redirect=None)) after connection broken by 'ProtocolError('Connection aborted.', error(101, 'Network is unreachable'))': /simple/setuptools/
retrying (Retry(total=0, connect=None, read=None, redirect=None)) after connection broken by 'ProtocolError('Connection aborted.', error(101, 'Network is unreachable'))': /simple/setuptools/
Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'ProtocolError('Connection aborted.', error(101, 'Network is unreachable'))': /simple/setuptools/
This looks like a site without outbound connectivity. Which site was it, do you know?
I'm not sure which site it was. I'll keep an eye out. We can close this issue seeing as how the experiment/aimes
branch derived from the feature/split_module_2
branch can execute jobs on OSG
I've been submitting jobs to OSG using the most recent commit of the feature/split_module_2 branch and the feature/osg_optimization branches on SAGA. I submitted a test CU to a pilot to run
/bin/date
, but after ~10min the script will not run.From the log files, the pilot has entered the state ACTIVE_PENDING, which I assume is the same as PendingActive from the previous log format. The umgr hangs up in the state UMGR_SCHEDULING (for both round robin and backfilling scheduler). On OSG, the submitted job is shown to be running (using the command
condor_q
)Please let me know if you there is any other information you would need. I would try to look under
radical.pilot.sandbox
, but OSG is not very helpful in that regard.