[BUG] tcp_keepalive not keeping TCP alive

cruscio commented 4 years ago

Description Using the TCP transport, the tcp_keepalive* settings do not keep the publish connection alive when there is a network device with client/server timeouts between the minion and master.

Setup

The below code sets up a master and minion, with haproxy between them, in docker containers

docker network create tcp_timeout

Salt Master

Start an Ubuntu:18.04 container. Install, configure and start a salt-master

docker run --rm -it --name salt-master --network tcp_timeout ubuntu:18.04 /bin/bash

#In the docker container
export DEBIAN_FRONTEND=noninteractive \
&& export PYTHONIOENCODING=utf8 \
&& apt-get update && apt-get install -y --no-install-recommends apt-utils && apt-get install -y gnupg wget openssl \
&& wget -O - https://repo.saltstack.com/py3/ubuntu/18.04/amd64/latest/SALTSTACK-GPG-KEY.pub | apt-key add - \
&& echo 'deb http://repo.saltstack.com/py3/ubuntu/18.04/amd64/3000 bionic main' | tee /etc/apt/sources.list.d/saltstack.list \
&& apt-get update && apt-get install -y salt-master python3-m2crypto \
&& echo "auto_accept: True" > /etc/salt/master.d/auto_accept.conf \
&& echo "transport: tcp"                   > /etc/salt/master.d/transport_tls.conf \
&& echo "tcp_keepalive: True"             >> /etc/salt/master.d/transport_tls.conf \
&& echo "tcp_keepalive_idle: 30"          >> /etc/salt/master.d/transport_tls.conf \
&& echo "tcp_keepalive_intvl: 30"         >> /etc/salt/master.d/transport_tls.conf \
&& /usr/bin/salt-master -l debug

HAProxy (simulating a network device with a TCP client timeout)

Write a configuration file to proxy salt-master TCP streams. Start an HAProxy container with that config.

cat << EOF > ./haproxy.cfg
global
  log stdout local0 info

defaults
  log global
  option tcplog
    timeout client          120s
    timeout server          120s
    timeout connect         120s

resolvers docker_dns
    nameserver dns1 127.0.0.11:53
    accepted_payload_size 8192

frontend publish_frontend
    bind    0.0.0.0:4505
    default_backend publish_backend

frontend return_frontend
    bind    0.0.0.0:4506
    default_backend return_backend

backend publish_backend
    mode   tcp
    server salt-master salt-master:4505 check resolvers docker_dns

backend return_backend
    mode   tcp
    server socm-001 salt-master:4506 check resolvers docker_dns
EOF

docker run --rm -it --name haproxy --network tcp_timeout -v ${PWD}/haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg haproxy:2.1.4

Salt Minion

Start an Ubuntu:18.04 container. Install, configure and start a salt-minion pointed at the HAProxy container as its master

docker run --rm -it --name salt-minion --network tcp_timeout ubuntu:18.04 /bin/bash

#In the docker container
export DEBIAN_FRONTEND=noninteractive \
&& export PYTHONIOENCODING=utf8 \
&& apt-get update && apt-get install -y --no-install-recommends apt-utils && apt-get install -y gnupg wget openssl \
&& wget -O - https://repo.saltstack.com/py3/ubuntu/18.04/amd64/latest/SALTSTACK-GPG-KEY.pub | apt-key add - \
&& echo 'deb http://repo.saltstack.com/py3/ubuntu/18.04/amd64/3000 bionic main' | tee /etc/apt/sources.list.d/saltstack.list \
&& apt-get update && apt-get install -y salt-minion python3-m2crypto \
&& echo "master: haproxy" > /etc/salt/minion.d/master.conf \
&& echo "transport: tcp"                   > /etc/salt/minion.d/transport_tls.conf \
&& echo "tcp_keepalive: True"             >> /etc/salt/minion.d/transport_tls.conf \
&& echo "tcp_keepalive_idle: 30"          >> /etc/salt/minion.d/transport_tls.conf \
&& echo "tcp_keepalive_intvl: 30"         >> /etc/salt/minion.d/transport_tls.conf \
&& echo "ping_interval: 1"                >> /etc/salt/minion.d/transport_tls.conf \
&& echo "#ping_interval 1 = 60s"          >> /etc/salt/minion.d/transport_tls.conf \
&& /usr/bin/salt-minion --log-level info --log-file-level debug

Steps to Reproduce the behavior

Start the master, proxy, and minion containers
Wait 2 minutes
Network device closes the publish connection, despite a 30s tcp_keepalive and a 60s ping_interval

HAProxy logs either a cD or an sD for the termination flag when closing the connection. Basically, no data was sent for the duration of the timeout.

https://www.haproxy.org/download/2.1/doc/configuration.txt
cD   The client did not send nor acknowledge any data for as long as the
     "timeout client" delay. This is often caused by network failures on
     the client side, or the client simply leaving the net uncleanly.
--
sD   The server did not send nor acknowledge any data for as long as the
     "timeout server" setting during the data phase. This is often caused
     by too short timeouts on L4 equipment before the server (firewalls,
     load-balancers, ...), as well as keep-alive sessions maintained
     between the client and the server expiring first on haproxy.

Expected behavior

The documentation ( master / minion ) indicates those settings are for issues in 'messy network environments with misbehaving firewalls', but that doesn't seem to be the case.

One workaround would be to do a test.ping at a higher frequency than the network timeout, but this has a not-insignificant CPU cost on Windows minions, where a separate minion process needs to be spun up for every job (it's particularly costly in messy windows antivirus environments)

Expected: There should be a keepalive setting that sends some data across the TCP Transport's publish channel in a way that doesn't require the minion to fork/spawn a worker.

Versions Report

Salt Master

Salt Version:
           Salt: 3000.2

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.6.1
      docker-py: Not Installed
          gitdb: 2.0.3
      gitpython: 2.1.8
         Jinja2: 2.10
        libgit2: Not Installed
       M2Crypto: 0.31.0
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.6.9 (default, Apr 18 2020, 01:56:04)
   python-gnupg: 0.4.1
         PyYAML: 3.12
          PyZMQ: 16.0.2
          smmap: 2.0.3
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.2.5

System Versions:
           dist: Ubuntu 18.04 bionic
         locale: utf8
        machine: x86_64
        release: 4.19.84-microsoft-standard
         system: Linux
        version: Ubuntu 18.04 bionic

Salt Minion

Salt Version:
           Salt: 3000.2

Dependency Versions:
           cffi: Not Installed
       cherrypy: Not Installed
       dateutil: 2.6.1
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
         Jinja2: 2.10
        libgit2: Not Installed
       M2Crypto: 0.31.0
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: Not Installed
      pycparser: Not Installed
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.6.9 (default, Apr 18 2020, 01:56:04)
   python-gnupg: 0.4.1
         PyYAML: 3.12
          PyZMQ: 16.0.2
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.2.5

System Versions:
           dist: Ubuntu 18.04 bionic
         locale: utf8
        machine: x86_64
        release: 4.19.84-microsoft-standard
         system: Linux
        version: Ubuntu 18.04 bionic

Additional context

Salt Minion Log

root@af322572498d:/# /usr/bin/salt-minion --log-level info --log-file-level debug
[INFO    ] Setting up the Salt Minion "af322572498d"
[INFO    ] Starting up the Salt Minion
[INFO    ] Starting pull socket on /var/run/salt/minion/minion_event_44b4e7fb2e_pull.ipc
[INFO    ] Although 'dmidecode' was found in path, the current user cannot execute it. Grains output might not be accurate.
[INFO    ] Creating minion process manager
[INFO    ] Generating keys: /etc/salt/pki/minion
[INFO    ] Exception caught in SaltMessageClient.close: IOLoop is already running
[INFO    ] Exception caught in SaltMessageClient.close: IOLoop is already running
[INFO    ] Although 'dmidecode' was found in path, the current user cannot execute it. Grains output might not be accurate.
[INFO    ] Executing command ['date', '+%z'] in directory '/root'
[INFO    ] Added new job __mine_interval to scheduler
[INFO    ] Added mine.update to scheduler
[INFO    ] Minion is starting as user 'root'
[INFO    ] Minion is ready to receive requests!
[INFO    ] Running scheduled job: __mine_interval
[INFO    ] Exception caught in SaltMessageClient.close: IOLoop is already running
[INFO    ] Connection to master haproxy lost
[INFO    ] Exception caught in SaltMessageClient.close: IOLoop is already running
[INFO    ] Connection to master haproxy re-established
[INFO    ] Exception caught in SaltMessageClient.close: IOLoop is already running
[INFO    ] Exception caught in SaltMessageClient.close: IOLoop is already running
[INFO    ] Connection to master haproxy lost
[INFO    ] Exception caught in SaltMessageClient.close: IOLoop is already running
[INFO    ] Connection to master haproxy re-established
[INFO    ] Exception caught in SaltMessageClient.close: IOLoop is already running
[INFO    ] Connection to master haproxy lost
[WARNING ] /usr/lib/python3/dist-packages/salt/transport/tcp.py:1076: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
  log.warn('TCP Message Client encountered an exception %r', exc)

[WARNING ] TCP Message Client encountered an exception StreamClosedError('Stream is closed',)
^C[WARNING ] Minion received a SIGINT. Exiting.
[INFO    ] Shutting down the Salt Minion

HAProxy Log

└─▪ docker run --rm -it --name haproxy --network tcp_timeout -v ${PWD}/haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg haproxy:2.1.4
<133>May  5 12:43:04 haproxy[1]: Proxy publish_frontend started.
<133>May  5 12:43:04 haproxy[1]: Proxy return_frontend started.
<133>May  5 12:43:04 haproxy[1]: Proxy publish_backend started.
<133>May  5 12:43:04 haproxy[1]: Proxy return_backend started.
[NOTICE] 125/124304 (1) : New worker #1 (6) forked
<134>May  5 12:43:47 haproxy[6]: 172.18.0.3:37578 [05/May/2020:12:43:47.207] return_frontend return_backend/socm-001 1/-1/0 0 -- 1/1/0/0/0 0/0
<134>May  5 12:43:47 haproxy[6]: 172.18.0.3:37582 [05/May/2020:12:43:47.340] return_frontend return_backend/socm-001 1/0/43 1030 -- 1/1/0/0/0 0/0
<134>May  5 12:43:47 haproxy[6]: 172.18.0.3:37590 [05/May/2020:12:43:47.418] return_frontend return_backend/socm-001 1/2/200 355 -- 2/1/0/0/0 0/0
<134>May  5 12:43:47 haproxy[6]: 172.18.0.3:37596 [05/May/2020:12:43:47.927] return_frontend return_backend/socm-001 1/0/9 84 -- 2/1/0/0/0 0/0
<134>May  5 12:43:48 haproxy[6]: 172.18.0.3:37600 [05/May/2020:12:43:47.944] return_frontend return_backend/socm-001 1/0/63 84 -- 2/1/0/0/0 0/0
<134>May  5 12:44:48 haproxy[6]: 172.18.0.3:37724 [05/May/2020:12:44:48.021] return_frontend return_backend/socm-001 1/1/16 84 -- 2/1/0/0/0 0/0
<134>May  5 12:45:47 haproxy[6]: 172.18.0.3:59514 [05/May/2020:12:43:47.387] publish_frontend publish_backend/salt-master 1/0/120002 0 sD 1/1/0/0/0 0/0
<134>May  5 12:45:47 haproxy[6]: 172.18.0.3:37850 [05/May/2020:12:45:47.407] return_frontend return_backend/socm-001 1/0/77 1295 -- 2/1/0/0/0 0/0
<134>May  5 12:45:47 haproxy[6]: 172.18.0.3:37854 [05/May/2020:12:45:47.499] return_frontend return_backend/socm-001 1/0/18 84 -- 2/1/0/0/0 0/0
<134>May  5 12:45:48 haproxy[6]: 172.18.0.3:37858 [05/May/2020:12:45:48.019] return_frontend return_backend/socm-001 1/0/6 84 -- 2/1/0/0/0 0/0
<134>May  5 12:46:48 haproxy[6]: 172.18.0.3:37982 [05/May/2020:12:46:48.025] return_frontend return_backend/socm-001 1/1/13 84 -- 2/1/0/0/0 0/0
<134>May  5 12:47:47 haproxy[6]: 172.18.0.3:59774 [05/May/2020:12:45:47.393] publish_frontend publish_backend/salt-master 1/0/120093 0 cD 1/1/0/0/0 0/0
<134>May  5 12:47:47 haproxy[6]: 172.18.0.3:38110 [05/May/2020:12:47:47.509] return_frontend return_backend/socm-001 1/0/43 1295 -- 2/1/0/0/0 0/0
<134>May  5 12:47:47 haproxy[6]: 172.18.0.3:38114 [05/May/2020:12:47:47.564] return_frontend return_backend/socm-001 1/1/40 84 -- 2/1/0/0/0 0/0
<134>May  5 12:47:48 haproxy[6]: 172.18.0.3:38118 [05/May/2020:12:47:48.026] return_frontend return_backend/socm-001 1/2/84 84 -- 2/1/0/0/0 0/0
^C[WARNING] 125/124844 (1) : Exiting Master process...
[ALERT] 125/124844 (1) : Current worker #1 (6) exited with code 130 (Interrupt)
[WARNING] 125/124844 (1) : All workers exited. Exiting... (130)

waynew commented 4 years ago

:+1: Thanks for the report @cruscio! I'll take a look at this and see if I can reproduce. Looks like your setup instructions are straight forward and I shouldn't have any issues there.

waynew commented 4 years ago

Yep, I see exactly this same behavior. Thanks for such a detailed report!

wegenerbenjamin commented 3 years ago

Is there an update on this issue? We ran into the same problem, but for publish.runner. It also looks like the timeout settings provided to the publish.* modules are not propagated to the underlying transport.tcp client. As a workaround, we modified the publish.* to include the timeout, i.e.

ret = channel.send(load, timeout=timeout)

instead of

ret = channel.send(load)

with load being a dict containing the key tmo with the timeout value.

Vaelatern commented 5 months ago

Starlink kills tcp connections after 20 seconds without activity.

saltstack / salt