radical-cybertools / htbac

High throughput binding affinity calculator
MIT License
2 stars 1 forks source link

ssh tunnel disconnection on Blue Waters #18

Open jdakka opened 6 years ago

jdakka commented 6 years ago

Request/issue: "Long queue times, typically at least a day - previous experience suggests frequent RCT disconnects"

  1. RADICAL-SAGA is the RCT layer responsible for acquiring an ssh prompt to the login node of Blue Waters. Make sure to enable the following environment variable before submitting: export SAGA_PTY_SSH_TIMEOUT=2000

  2. While the job is in the queue, SAGA will periodically check on the job. @andre-merzky, what possible sources of ssh disconnection exist at this stage?

  3. Once the pilot is active, there can be issues where the client loses connection to the agent. @andre-merzky-- is this still a possible issue?

** Note if the job is killed via keyboard interrupt, make sure to hit ctrl-C only once. Doing ctrl-C multiple times will kill the process that is attempting to terminate the job properly.

andre-merzky commented 6 years ago

Note that the SAGA_PTY_SSH_TIMEOUT is only relevant to the ssh handshake - once the ssh connection is up, the value has no effect.

@andre-merzky, what possible sources of ssh disconnection exist at this stage?

Two main sources: network errors (rare) and system settings (frequent). Most unix systems limit the lifetime of an ssh session to 12 or 24 hours - this is the limit we are hitting in scenarios like yours I believe. We have seen ssh connection drops on bad connections, and they are just as fatal - but this is usually only a problem on laptops (which can run out of power, getting closed and suspended, etc).

Once the pilot is active, there can be issues where the client loses connection to the agent. @andre-merzky-- is this still a possible issue?

The ssh connection is active over the whole lifetime of the application, in order to watch the pilot state.

Please note that we have two types of connections: ssh and MongoDB. The latter is less prone to get interrupted and timed out, but it can also happen.

mturilli commented 6 years ago

I think we should understand whether the requirement is to run pilots for more than 12 hours. If this is the case, then we would need to discuss new functionalities for RP and this would become a (set of) feature requests.

jdakka commented 6 years ago
jdakka commented 6 years ago
jdakka commented 6 years ago