Open jdakka opened 6 years ago
Note that the SAGA_PTY_SSH_TIMEOUT
is only relevant to the ssh handshake - once the ssh connection is up, the value has no effect.
@andre-merzky, what possible sources of ssh disconnection exist at this stage?
Two main sources: network errors (rare) and system settings (frequent). Most unix systems limit the lifetime of an ssh session to 12 or 24 hours - this is the limit we are hitting in scenarios like yours I believe. We have seen ssh connection drops on bad connections, and they are just as fatal - but this is usually only a problem on laptops (which can run out of power, getting closed and suspended, etc).
Once the pilot is active, there can be issues where the client loses connection to the agent. @andre-merzky-- is this still a possible issue?
The ssh connection is active over the whole lifetime of the application, in order to watch the pilot state.
Please note that we have two types of connections: ssh and MongoDB. The latter is less prone to get interrupted and timed out, but it can also happen.
I think we should understand whether the requirement is to run pilots for more than 12 hours. If this is the case, then we would need to discuss new functionalities for RP and this would become a (set of) feature requests.
Request/issue: "Long queue times, typically at least a day - previous experience suggests frequent RCT disconnects"
RADICAL-SAGA is the RCT layer responsible for acquiring an ssh prompt to the login node of Blue Waters. Make sure to enable the following environment variable before submitting:
export SAGA_PTY_SSH_TIMEOUT=2000
While the job is in the queue, SAGA will periodically check on the job. @andre-merzky, what possible sources of ssh disconnection exist at this stage?
Once the pilot is active, there can be issues where the client loses connection to the agent. @andre-merzky-- is this still a possible issue?
** Note if the job is killed via keyboard interrupt, make sure to hit
ctrl-C
only once. Doingctrl-C
multiple times will kill the process that is attempting to terminate the job properly.