Closed sanjay-saxena closed 7 years ago
@sanjay-saxena - thanks for writing this up! Yeah, I think we should set the ssh keep alive to whatever the idle timeout value is plus some factor (thinking 60
seconds). If culling is not enabled (i.e., idle timeout is 0
) then we should probably use a maxint
value.
I was wondering if there is an API that helps retrieve the values of command-line options that were passed to Enterprise Gateway(EG). For example, if EG was started with --MappingKernelManager.cull_idle_timeout=300
, do we(KG/EG) have APIs to retrieve the values of command-line options?
@sanjay-saxena - The cull_idle_timeout
value is a class variable on MappingKernelManager
. This class is a super class (a couple levels away) to RemoteMappingKernelManager
- which is what ultimately creates KernelManager
instances (or RemoteKernelManager
instances in our case). As a result, you should be able to access cull_idle_timeout
in processproxy.py
via self.kernel_manager.parent.cull_idle_timeout
.
One of the things that we need to ensure here is whether all six SSH connections need the ServerAliveInterval
option specified. When the user starts interacting with a notebook, do all six SSH connections now become active(meaning sending/receiving data)? I am worried that if any one of the connections does not become active, then it may get disconnected while other SSH connections may chug along. Note that I have not observed this behavior and my thoughts are based on a purely hypothetical but a conceivable scenario.
Do we send a heartbeat(ping/pong) periodically on each of the connections to let the underlying layers know that the connection is alive? Do we have a timer on each connection that gets reset every time there is traffic?
@sanjay-saxena good point. Looking at the jupyter messaging protocol, it looks like all but one port is frequently involved - either constantly (like the heartbeat port), or via cell actions (like stdin, iopub and shell). The control port, on the other hand, could expire the keep alive if the user doesn't shutdown or restart (or whatever else goes over that port) the kernel within the culling timeout. Perhaps this is why the jupyter console app doesn't tunnel the control port (although it could also be because they don't provide that level of functionality via the console).
Soooo, it seems like we might want to bring back the timeout
parameter with values of cull_idle_timeout
+ some constant for all ports but the control port and maxint
for control. In addition, we'll want to treat "comm port" we've introduced for remote interrupts the same as the control port.
EG supports the notion of
keepalive
by means of the configuration parameter--MappingKernelManager.cull_idle_timeout
. EG recommends setting the configuration parameter to 12hours in a production environment.SSH supports the notion of
keepalive
by means ofClientAliveInterval
(on the server --/etc/ssh/sshd_config
) andServerAliveInterval
(on the client --/etc/ssh/ssh_config
). By default, these parameters are not set. Since SSH runs on top of TCP, I am assuming that it falls back to TCP'skeepalive
characteristics. On RedHat 7.3 based machines, the default TCPkeepalive
is configured as shown below:Note that
net.ipv4.tcp_keepalive_time
is in seconds. This means, if there is no activity between the client and the server for 2hours(or 7200 seconds), TCP will initiate 9 probes that are 75seconds apart. And, if the other side does not respond to any of those probes, then TCP will drop the connection.This ticket is to investigate SSH's
keepalive
mechanism and see how it works with EG'skeepalive
mechanism. Based on the findings, this may involve code changes as well as documentation changes. As far as I see, there are two main objectives here:SSH connections should not get terminated due to inactivity while the Kernel is still running. This means that the Kernel's
cull_idle_timeout
must be greater than or equal to thekeep-alive
interval of the individual SSH connections.When there is any user activity in the notebook, all six SSH connections should be able to recognize it to be able to reset the timeout. This is needed to ensure that all six SSH connections stay intact and we don't end up in a situation where some of the SSH connections get torn down as there was no traffic on them.