Investigate SSH keep-alive using ServerAliveInterval and the overlap with CullingTimeout

sanjay-saxena commented 7 years ago

EG supports the notion of keepalive by means of the configuration parameter --MappingKernelManager.cull_idle_timeout. EG recommends setting the configuration parameter to 12hours in a production environment.

SSH supports the notion of keepalive by means of ClientAliveInterval(on the server -- /etc/ssh/sshd_config) and ServerAliveInterval(on the client --/etc/ssh/ssh_config). By default, these parameters are not set. Since SSH runs on top of TCP, I am assuming that it falls back to TCP's keepalive characteristics. On RedHat 7.3 based machines, the default TCP keepalive is configured as shown below:

[root@hostname .ssh]# sysctl --all | grep keepalive
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

Note that net.ipv4.tcp_keepalive_time is in seconds. This means, if there is no activity between the client and the server for 2hours(or 7200 seconds), TCP will initiate 9 probes that are 75seconds apart. And, if the other side does not respond to any of those probes, then TCP will drop the connection.

This ticket is to investigate SSH's keepalive mechanism and see how it works with EG's keepalive mechanism. Based on the findings, this may involve code changes as well as documentation changes. As far as I see, there are two main objectives here:

SSH connections should not get terminated due to inactivity while the Kernel is still running. This means that the Kernel's cull_idle_timeout must be greater than or equal to the keep-alive interval of the individual SSH connections.
When there is any user activity in the notebook, all six SSH connections should be able to recognize it to be able to reset the timeout. This is needed to ensure that all six SSH connections stay intact and we don't end up in a situation where some of the SSH connections get torn down as there was no traffic on them.

kevin-bates commented 7 years ago

@sanjay-saxena - thanks for writing this up! Yeah, I think we should set the ssh keep alive to whatever the idle timeout value is plus some factor (thinking 60 seconds). If culling is not enabled (i.e., idle timeout is 0) then we should probably use a maxint value.

sanjay-saxena commented 7 years ago

I was wondering if there is an API that helps retrieve the values of command-line options that were passed to Enterprise Gateway(EG). For example, if EG was started with --MappingKernelManager.cull_idle_timeout=300, do we(KG/EG) have APIs to retrieve the values of command-line options?

kevin-bates commented 7 years ago

@sanjay-saxena - The cull_idle_timeout value is a class variable on MappingKernelManager. This class is a super class (a couple levels away) to RemoteMappingKernelManager - which is what ultimately creates KernelManager instances (or RemoteKernelManager instances in our case). As a result, you should be able to access cull_idle_timeout in processproxy.py via self.kernel_manager.parent.cull_idle_timeout.

sanjay-saxena commented 7 years ago

One of the things that we need to ensure here is whether all six SSH connections need the ServerAliveInterval option specified. When the user starts interacting with a notebook, do all six SSH connections now become active(meaning sending/receiving data)? I am worried that if any one of the connections does not become active, then it may get disconnected while other SSH connections may chug along. Note that I have not observed this behavior and my thoughts are based on a purely hypothetical but a conceivable scenario.

Do we send a heartbeat(ping/pong) periodically on each of the connections to let the underlying layers know that the connection is alive? Do we have a timer on each connection that gets reset every time there is traffic?

kevin-bates commented 7 years ago

@sanjay-saxena good point. Looking at the jupyter messaging protocol, it looks like all but one port is frequently involved - either constantly (like the heartbeat port), or via cell actions (like stdin, iopub and shell). The control port, on the other hand, could expire the keep alive if the user doesn't shutdown or restart (or whatever else goes over that port) the kernel within the culling timeout. Perhaps this is why the jupyter console app doesn't tunnel the control port (although it could also be because they don't provide that level of functionality via the console).

Soooo, it seems like we might want to bring back the timeout parameter with values of cull_idle_timeout + some constant for all ports but the control port and maxint for control. In addition, we'll want to treat "comm port" we've introduced for remote interrupts the same as the control port.

jupyter-server / enterprise_gateway

Investigate SSH keep-alive using ServerAliveInterval and the overlap with CullingTimeout #207