jdakka / radical.htbac

0 stars 0 forks source link

RP stalls after submit #1

Open kristofarkas opened 6 years ago

kristofarkas commented 6 years ago

When submitting a job, the script just stalls after the submit 1 pilot(s) message. I haven't experienced this before, but now it happens every time. The same script worked before.

The logs: The Running... is printed out right before calling AppManager.run().

TIES pipeline has 6 stages. Tasks counts: [65, 65, 65, 65, 5, 1]
TIES pipeline has 6 stages. Tasks counts: [65, 65, 65, 65, 5, 1]
Running on 4160 cores.
2017-12-06 09:08:49,246: radical.entk.resource_manager: MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.14 |Anaconda, Inc.| (default, Oct 5 2017, 02:28:52) [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
2017-12-06 09:08:49,246: radical.entk.resource_manager: MainProcess                     : MainThread     : INFO    :                      pid: 32705
2017-12-06 09:08:49,246: radical.entk.resource_manager: MainProcess                     : MainThread     : INFO    :                      tid: MainThread
2017-12-06 09:08:49,249: radical.entk.resource_manager: MainProcess                     : MainThread     : INFO    : Resource Manager initialized
2017-12-06 09:08:49,250: radical.entk.resource_manager: MainProcess                     : MainThread     : DEBUG   : Validating resource description
2017-12-06 09:08:49,250: radical.entk.resource_manager: MainProcess                     : MainThread     : INFO    : Resource description validated
2017-12-06 09:08:49,250: radical.entk.resource_manager: MainProcess                     : MainThread     : DEBUG   : Populating resource manager object
2017-12-06 09:08:49,250: radical.entk.resource_manager: MainProcess                     : MainThread     : DEBUG   : Resource manager population successful
2017-12-06 09:08:49,251: radical.entk.appmanager: MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.14 |Anaconda, Inc.| (default, Oct 5 2017, 02:28:52) [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
2017-12-06 09:08:49,251: radical.entk.appmanager: MainProcess                     : MainThread     : INFO    :                      pid: 32705
2017-12-06 09:08:49,251: radical.entk.appmanager: MainProcess                     : MainThread     : INFO    :                      tid: MainThread
2017-12-06 09:08:49,254: radical.entk.appmanager: MainProcess                     : MainThread     : INFO    : Application Manager initialized
2017-12-06 09:08:49,254: radical.entk.appmanager: MainProcess                     : MainThread     : INFO    : Workflow assigned to Application Manager
Running...
2017-12-06 09:08:49,254: radical.entk.appmanager: MainProcess                     : MainThread     : INFO    : Setting up RabbitMQ system
2017-12-06 09:08:49,255: radical.entk.appmanager: MainProcess                     : MainThread     : DEBUG   : Setting up mq connection and channel
2017-12-06 09:08:49,264: radical.entk.appmanager: MainProcess                     : MainThread     : DEBUG   : Connection and channel setup successful
2017-12-06 09:08:49,264: radical.entk.appmanager: MainProcess                     : MainThread     : DEBUG   : Setting up all exchanges and queues
2017-12-06 09:08:49,535: radical.entk.appmanager: MainProcess                     : MainThread     : DEBUG   : All exchanges and queues are setup
2017-12-06 09:08:49,535: radical.entk.appmanager: MainProcess                     : MainThread     : INFO    : Starting resource request submission
new session: [rp.session.julian.chem.ucl.ac.uk.kristof.farkas-pall.017506.0011]\
database   : [mongodb://radical:fg*2GT3^eB@crick.chem.ucl.ac.uk:27017/admin]  ok
create pilot manager                                                          ok
submit 1 pilot(s)

Could this be related to the RabbitMQ or MongoDB server? Or what else?

andre-merzky commented 6 years ago

Can you please attach the session sandbox (./rp.session.julian.chem.ucl.ac.uk.kristof.farkas-pall.017506.0011/) from the client side, ie. from the directory you run your script in? Thanks!

kristofarkas commented 6 years ago

rp.session.tar.gz

andre-merzky commented 6 years ago

I think the session gets hung up trying to connect to BlueWaters, and that call seems to never return -- see end of rp.session.julian.chem.ucl.ac.uk.kristof.farkas-pall.017506.0011.log. Does gsissh get onto the machine from command line? Maybe your gsi proxy is expired? We use pretty long timeouts nowadays for the connection setup - but if that never returns I would consider that an error, and would deserve a ticket in radical.pilot. If so, please rerun with the following settings:

export RADICAL_SAGA_PTY_VERBOSE=DEBUG
export RADICAL_SAGA_PTY_LOG_TGT=pty.log

This will create an additional log on the GSI connection layer - please attach that if this continues to be a problem.

Thanks!

kristofarkas commented 6 years ago

A restart of the computer seems to have fixed the problem. The certificates are all set up correctly btw and gsissh worked from the terminal.

andre-merzky commented 6 years ago

Hmm, we really don't want to end up in a mode were we'd recommend a reboot to fix RP problems :-P If you happen to stumble over this again, please send an PTY log (as described above), and I'll track it down. Either way though, I am glad you are back in business :)

kristofarkas commented 6 years ago

The pty.log from a stalled session:

2017-12-06 10:54:59,103: radical.saga.pty    : MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.14 |Anaconda, Inc.| (default, Oct 5 2017, 0
2017-12-06 10:54:59,104: radical.saga.pty    : MainProcess                     : MainThread     : INFO    :                      pid: 9484
2017-12-06 10:54:59,104: radical.saga.pty    : MainProcess                     : MainThread     : INFO    :                      tid: MainThread
2017-12-06 10:54:59,104: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : PTYShell init <saga.utils.pty_shell.PTYShell object at 0x114c286d0>
2017-12-06 10:54:59,108: radical.saga.pty    : MainProcess                     : MainThread     : INFO    : PTY prompt pattern: [\$#%>\]]\s*$
2017-12-06 10:54:59,109: radical.saga.pty    : MainProcess                     : MainThread     : INFO    : ssh copy  mode set to 'sftp'
2017-12-06 10:54:59,109: radical.saga.pty    : MainProcess                     : MainThread     : INFO    : ssh share mode set to 'auto'
2017-12-06 10:54:59,109: radical.saga.pty    : MainProcess                     : MainThread     : INFO    : ssh timeout    set to '2000.0'
2017-12-06 10:54:59,111: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : SSH Connection M_FLAGS: -o ControlMaster=auto -o ControlPath=/tmp/saga_ssh_kr
2017-12-06 10:54:59,111: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : SSH Connection S_FLAGS: -o ControlMaster=auto -o ControlPath=/tmp/saga_ssh_kr
2017-12-06 10:54:59,111: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : open master pty for [ssh] [bw.ncsa.illinois.edu] kristof.farkas-pall: /usr/bi
2017-12-06 10:54:59,112: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : PTYProcess init <saga.utils.pty_process.PTYProcess object at 0x114c58d50>
2017-12-06 10:54:59,113: radical.saga.pty    : MainProcess                     : MainThread     : INFO    : running: /usr/bin/env TERM=vt100 X509_USER_PROXY=/tmp/x509up_u501 /Library/Gl
2017-12-06 10:54:59,118: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : write: [  200] [   51] ( export PROMPT_COMMAND='' PS1='$' ; set prompt='$'\n)
2017-12-06 10:54:59,249: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : read : [  200] [   36] (/Library/Globus/etc/ssh not found.\n)
2017-12-06 10:55:00,128: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : write: [  200] [   69] ( export PROMPT_COMMAND='' PS1='$' > /dev/null 2>&1 ||
2017-12-06 10:55:00,128: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : write: [  200] [   28] ( printf 'HELLO_%d_SAGA\n' 1\n)
2017-12-06 10:55:01,139: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : write: [  200] [   69] ( export PROMPT_COMMAND='' PS1='$' > /dev/null 2>&1 ||
2017-12-06 10:55:01,139: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : write: [  200] [   28] ( printf 'HELLO_%d_SAGA\n' 2\n)
2017-12-06 10:55:02,146: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : write: [  200] [   69] ( export PROMPT_COMMAND='' PS1='$' > /dev/null 2>&1 ||
2017-12-06 10:55:02,146: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : write: [  200] [   28] ( printf 'HELLO_%d_SAGA\n' 3\n)
2017-12-06 10:55:03,156: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : write: [  200] [   69] ( export PROMPT_COMMAND='' PS1='$' > /dev/null 2>&1 ||
2017-12-06 10:55:03,156: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : write: [  200] [   28] ( printf 'HELLO_%d_SAGA\n' 4\n)
2017-12-06 10:55:04,158: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : write: [  200] [   69] ( export PROMPT_COMMAND='' PS1='$' > /dev/null 2>&1 ||
2017-12-06 10:55:04,158: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : write: [  200] [   28] ( printf 'HELLO_%d_SAGA\n' 5\n)
2017-12-06 10:55:05,159: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : write: [  200] [   69] ( export PROMPT_COMMAND='' PS1='$' > /dev/null 2>&1 ||
2017-12-06 10:55:05,160: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : write: [  200] [   28] ( printf 'HELLO_%d_SAGA\n' 6\n)
2017-12-06 10:55:06,168: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : write: [  200] [   69] ( export PROMPT_COMMAND='' PS1='$' > /dev/null 2>&1 ||
2017-12-06 10:55:06,169: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : write: [  200] [   28] ( printf 'HELLO_%d_SAGA\n' 7\n)
2017-12-06 10:55:07,170: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : write: [  200] [   69] ( export PROMPT_COMMAND='' PS1='$' > /dev/null 2>&1 ||
2017-12-06 10:55:07,170: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : write: [  200] [   28] ( printf 'HELLO_%d_SAGA\n' 8\n)
2017-12-06 10:55:08,173: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : write: [  200] [   69] ( export PROMPT_COMMAND='' PS1='$' > /dev/null 2>&1 ||
2017-12-06 10:55:08,173: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : write: [  200] [   28] ( printf 'HELLO_%d_SAGA\n' 9\n)
2017-12-06 10:55:09,185: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : write: [  200] [   69] ( export PROMPT_COMMAND='' PS1='$' > /dev/null 2>&1 ||
2017-12-06 10:55:09,185: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : write: [  200] [   29] ( printf 'HELLO_%d_SAGA\n' 10\n)
2017-12-06 10:55:10,197: radical.saga.pty    : MainProcess                     : MainThread     : DEBUG   : write: [  200] [   69] ( export PROMPT_COMMAND='' PS1='$' > /dev/null 2>&1 ||
kristofarkas commented 6 years ago

These scripts are also running in the background; can that have any influence on this? Like this: python script.py > output 2>&1 & disown?

andre-merzky commented 6 years ago

the log is not complete, unfortunately (the lines are really long I'm afraid, and are truncated above).

Running in the background should not make a difference - but the output file may contain some additional messages.

Thanks!

kristofarkas commented 6 years ago

Sorry about that. Here is the full log: pty.log

This problem now happens after the first successful run (after a restart of the computer).

Update: deleting the certificate from /tmp/ and getting a new one with myproxy-logon also solves the problem.