Use keepalives - Githubissues

Note: This issue requires only a documentation change. Code does not need to be changed.

Something in the path to our database server, and/or the database engine itself, kills jobs that run longer than 120 minutes. This is the default keepalive time in both Linux and the database engine. One way to address this problem is to add keepalives on the client side at an interval shorter than the default (or whatever they happen to be set to be).

The PostgreSQL documentation for Database Connection Control Functions contains information on this. Note the following; this is a client-side setting, not a database engine setting:

The following functions deal with making a connection to a PostgreSQL backend server.

The documentation lists the functions and the parameters that can be passed to them. These parameters are listed in section 34.1.2, and include the following keepalive parameters.

keepalives Controls whether client-side TCP keepalives are used. The default value is 1, meaning on, but you can change this to 0, meaning off, if keepalives are not wanted. This parameter is ignored for connections made via a Unix-domain socket.

keepalives_idle Controls the number of seconds of inactivity after which TCP should send a keepalive message to the server. A value of zero uses the system default. This parameter is ignored for connections made via a Unix-domain socket, or if keepalives are disabled. It is only supported on systems where TCP_KEEPIDLE or an equivalent socket option is available, and on Windows; on other systems, it has no effect.

keepalives_interval Controls the number of seconds after which a TCP keepalive message that is not acknowledged by the server should be retransmitted. A value of zero uses the system default. This parameter is ignored for connections made via a Unix-domain socket, or if keepalives are disabled. It is only supported on systems where TCP_KEEPINTVL or an equivalent socket option is available, and on Windows; on other systems, it has no effect.

keepalives_count Controls the number of TCP keepalives that can be lost before the client's connection to the server is considered dead. A value of zero uses the system default. This parameter is ignored for connections made via a Unix-domain socket, or if keepalives are disabled. It is only supported on systems where TCP_KEEPCNT or an equivalent socket option is available; on other systems, it has no effect.

By setting these parameters to values that send keepalives at intervals shorter than the critical interval of 7200 s = 120 min, we can prevent the connection from being dropped.

Further, these parameters can be set in the database connection strings, which enables all clients to be set up appropriately without code change (which would be impossible with psql, for example). For example:

postgresql://user:password@db.pcic.uvic.ca/crmp?keepalives=1&keepalives_idle=300&keepalives_interval=300&keepalives_count=9

The 300 s (5 min) interval is very conservative, but given the minimal overhead for keepalives, it should not cause problems.

pacificclimate / pycds

Use keepalives #147