saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Install Salt from the Salt package repositories here:
https://docs.saltproject.io/salt/install-guide/en/latest/
Apache License 2.0
14.22k stars 5.49k forks source link

[BUG] [3006.9] `TCPReqClient` is leaking connections to master port 4506 #67076

Open yiip87 opened 19 hours ago

yiip87 commented 19 hours ago

Description Over time, a salt minion can accumulate a lot of connections to the master on port 4506:

lsof -ni :4506 ``` COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME /opt/salt 484470 root 17u IPv4 6186861 0t0 TCP 87.XXX.XXX.XXX:52452->10.XXX.XXX.5:4506 (ESTABLISHED) /opt/salt 484470 root 24u IPv4 6186867 0t0 TCP 87.XXX.XXX.XXX:52460->10.XXX.XXX.5:4506 (ESTABLISHED) /opt/salt 484470 root 25u IPv4 6186868 0t0 TCP 87.XXX.XXX.XXX:52472->10.XXX.XXX.5:4506 (ESTABLISHED) /opt/salt 484470 root 26u IPv4 6194226 0t0 TCP 87.XXX.XXX.XXX:52714->10.XXX.XXX.5:4506 (ESTABLISHED) /opt/salt 484470 root 27u IPv4 6778844 0t0 TCP 87.XXX.XXX.XXX:40964->10.XXX.XXX.5:4506 (ESTABLISHED) /opt/salt 484470 root 28u IPv4 6352325 0t0 TCP 87.XXX.XXX.XXX:35408->10.XXX.XXX.5:4506 (ESTABLISHED) /opt/salt 484470 root 29u IPv4 6778845 0t0 TCP 87.XXX.XXX.XXX:40968->10.XXX.XXX.5:4506 (ESTABLISHED) /opt/salt 484470 root 30u IPv4 6974213 0t0 TCP 87.XXX.XXX.XXX:46170->10.XXX.XXX.5:4506 (ESTABLISHED) /opt/salt 484470 root 31u IPv4 6977733 0t0 TCP 87.XXX.XXX.XXX:46172->10.XXX.XXX.5:4506 (ESTABLISHED) /opt/salt 484470 root 32u IPv4 7174154 0t0 TCP 87.XXX.XXX.XXX:57334->10.XXX.XXX.5:4506 (ESTABLISHED) /opt/salt 484470 root 33u IPv4 6352610 0t0 TCP 87.XXX.XXX.XXX:35418->10.XXX.XXX.5:4506 (ESTABLISHED) /opt/salt 484470 root 34u IPv4 6191663 0t0 TCP 87.XXX.XXX.XXX:52698->10.XXX.XXX.5:4506 (ESTABLISHED) /opt/salt 484470 root 35u IPv4 7174692 0t0 TCP 87.XXX.XXX.XXX:57336->10.XXX.XXX.5:4506 (ESTABLISHED) /opt/salt 484470 root 36u IPv4 8416860 0t0 TCP 87.XXX.XXX.XXX:39858->10.XXX.XXX.5:4506 (ESTABLISHED) /opt/salt 484470 root 37u IPv4 7364717 0t0 TCP 87.XXX.XXX.XXX:44166->10.XXX.XXX.5:4506 (ESTABLISHED) /opt/salt 484470 root 38u IPv4 6493059 0t0 TCP 87.XXX.XXX.XXX:56894->10.XXX.XXX.5:4506 (ESTABLISHED) /opt/salt 484470 root 39u IPv4 6495901 0t0 TCP 87.XXX.XXX.XXX:56902->10.XXX.XXX.5:4506 (ESTABLISHED) /opt/salt 484470 root 40u IPv4 8419406 0t0 TCP 87.XXX.XXX.XXX:39864->10.XXX.XXX.5:4506 (ESTABLISHED) /opt/salt 484470 root 41u IPv4 7363063 0t0 TCP 87.XXX.XXX.XXX:44186->10.XXX.XXX.5:4506 (ESTABLISHED) /opt/salt 484470 root 42u IPv4 8660855 0t0 TCP 87.XXX.XXX.XXX:36052->10.XXX.XXX.5:4506 (ESTABLISHED) /opt/salt 484470 root 43u IPv4 7550338 0t0 TCP 87.XXX.XXX.XXX:49082->10.XXX.XXX.5:4506 (ESTABLISHED) /opt/salt 484470 root 44u IPv4 8665928 0t0 TCP 87.XXX.XXX.XXX:36062->10.XXX.XXX.5:4506 (ESTABLISHED) /opt/salt 484470 root 49u IPv4 7553362 0t0 TCP 87.XXX.XXX.XXX:49086->10.XXX.XXX.5:4506 (ESTABLISHED) /opt/salt 484470 root 50u IPv4 7809973 0t0 TCP 87.XXX.XXX.XXX:40708->10.XXX.XXX.5:4506 (ESTABLISHED) /opt/salt 484470 root 51u IPv4 7815404 0t0 TCP 87.XXX.XXX.XXX:40710->10.XXX.XXX.5:4506 (ESTABLISHED) /opt/salt 484470 root 55u IPv4 8209315 0t0 TCP 87.XXX.XXX.XXX:42400->10.XXX.XXX.5:4506 (ESTABLISHED) /opt/salt 484470 root 57u IPv4 8210883 0t0 TCP 87.XXX.XXX.XXX:42410->10.XXX.XXX.5:4506 (ESTABLISHED) ```

With up to 3000 minions per master, this leads to oom_killer invocations on the master, killing the master completely or partial ReqServer MWorker processes.

After a minion restart, the connections go down to 5, but build up again over time. On our previous version (3000.8) only one connection to port 4506 was open at all times.

On the master, single ReqServer workers can get huge:

excerpt from top ``` top - 18:02:48 up 1 day, 22:12, 1 user, load average: 2.29, 2.09, 1.95 Tasks: 287 total, 2 running, 284 sleeping, 0 stopped, 1 zombie %Cpu(s): 1.4 us, 0.7 sy, 0.0 ni, 94.6 id, 3.2 wa, 0.0 hi, 0.0 si, 0.0 st GiB Mem : 30.9 total, 0.4 free, 21.8 used, 9.1 buff/cache GiB Swap: 14.9 total, 0.3 free, 14.6 used. 9.1 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2199 salt 20 0 6100.3m 2.3g 5.8m S 22.9 7.4 10,57 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-10 2188 salt 20 0 4367.6m 2.2g 5.8m S 0.0 7.2 328:30.73 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-1 2207 salt 20 0 5829.6m 2.1g 5.8m S 4.0 6.9 9,35 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-18 2200 salt 20 0 5525.8m 2.0g 5.9m S 1.0 6.4 8,22 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-11 2201 salt 20 0 5151.1m 1.8g 5.8m S 3.7 5.8 6,56 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-12 2205 salt 20 0 3619.7m 1.3g 5.8m S 0.0 4.3 254:38.61 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-16 2202 salt 20 0 2913.4m 979.8m 5.9m S 0.0 3.1 205:37.18 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-13 2107 salt 20 0 3812.7m 839.1m 5.0m S 0.0 2.6 0:37.68 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master PubServerChannel._publish_daemon 2198 salt 20 0 2274.6m 819.6m 5.7m S 0.0 2.6 155:13.57 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-9 2203 salt 20 0 1782.1m 597.9m 5.7m S 0.0 1.9 129:52.60 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-14 2193 salt 20 0 1255.6m 410.9m 5.5m S 0.0 1.3 79:11.94 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-5 2208 salt 20 0 1478.6m 408.7m 5.8m S 0.0 1.3 82:23.37 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-19 2206 salt 20 0 1180.7m 330.7m 5.4m S 0.0 1.0 58:19.45 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-17 2195 salt 20 0 1085.7m 328.9m 5.5m S 0.0 1.0 53:19.51 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-7 2187 salt 20 0 1217.6m 317.1m 5.4m S 0.0 1.0 62:02.60 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-0 2194 salt 20 0 1101.8m 304.7m 5.6m S 0.0 1.0 56:01.66 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-6 2189 salt 20 0 1076.1m 294.7m 5.5m S 0.0 0.9 49:39.19 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-2 2197 salt 20 0 1147.0m 292.2m 5.5m S 0.0 0.9 43:41.30 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-8 2204 salt 20 0 1052.0m 286.9m 5.5m S 0.0 0.9 55:27.10 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-15 2191 salt 20 0 1064.6m 282.2m 5.4m S 0.0 0.9 55:14.47 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-3 2192 salt 20 0 1048.0m 276.7m 5.5m S 0.0 0.9 60:56.03 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-4 2109 salt 20 0 1417.4m 192.8m 4.3m S 0.0 0.6 19:39.64 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master Engine(salt.loaded.int.engines.reactor) Reactor-3:1 2108 salt 20 0 300.5m 102.0m 3.3m S 0.3 0.3 5:17.45 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master EventPublisher 377163 salt 20 0 369.2m 55.4m 5.8m R 18.3 0.2 4:07.55 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master Maintenance ```

Setup

Minion config (directories excluded) ```yaml acceptance_wait_time: 60 auth_timeout: 30 auth_tries: 2 keysize: 2048 log_level: error log_level_logfile: info master: - 10.XXX.XXX.5 - 10.XXX.XXX.6 - 10.XXX.XXX.7 - 10.XXX.XXX.8 - 10.XXX.XXX.9 - 10.XXX.XXX.10 - 10.XXX.XXX.11 - 10.XXX.XXX.12 master_alive_interval: 240 master_failback: true master_failback_interval: 1800 master_type: failover random_master: true random_reauth_delay: 270 recon_default: 1000 recon_max: 199000 recon_randomize: true retry_dns: 0 transport: tcp verify_master_pubkey_sign: true ```
Master config (directories excluded) ```yaml interface: 10.XXX.XXX.5 transport: tcp keep_jobs_seconds: 86400 loop_interval: 60 job_cache: True minion_data_cache: True max_minions: 3000 con_cache: True worker_threads: 20 master_sign_pubkey: True master_pubkey_signature: pubkey_signature master_use_pubkey_signature: True rotate_aes_key: False pillar_includes_override_sls: True log_level: info log_level_logfile: info ```

Steps to Reproduce the behavior

Rght after the start of the minion, at least 3 connections to the master port 4506 are already open:

~ # lsof -ni :4506
COMMAND    PID USER   FD   TYPE    DEVICE SIZE/OFF NODE NAME
/opt/salt 5197 root   17u  IPv4 237399288      0t0  TCP 87.XXX.XXX.XXX:39310->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 5197 root   24u  IPv4 237395968      0t0  TCP 87.XXX.XXX.XXX:39326->10.XXX.XXX.5:4506 (ESTABLISHED)
/opt/salt 5197 root   25u  IPv4 237399295      0t0  TCP 87.XXX.XXX.XXX:39334->10.XXX.XXX.5:4506 (ESTABLISHED)

If you run schedules that execute states, there might be more open connections. Check the connections again after a day or a few days depending on the activity on your servers (salt-call from the minion, schedules on the minion, etc.). The connection should increase. On our systems we have five schedules with varying intervals (180s ± 60s -> Most frequent, 14400s ± 2700s -> least frequent), all of which had return_job: True which is the default.

Expected behavior The minion has one permanent connection to master:4505 and on demand connections to master:4506

Versions Report

salt --versions-report (Minion) (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.) ```yaml Salt Version: Salt: 3006.9 Python Version: Python: 3.10.14 (main, Jun 26 2024, 11:44:37) [GCC 11.2.0] Dependency Versions: cffi: 1.14.6 cherrypy: 18.6.1 cryptography: 42.0.5 dateutil: 2.8.1 docker-py: Not Installed gitdb: Not Installed gitpython: Not Installed Jinja2: 3.1.4 libgit2: Not Installed looseversion: 1.0.2 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.2 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 22.0 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.19.1 pygit2: Not Installed python-gnupg: 0.4.8 PyYAML: 6.0.1 PyZMQ: 23.2.0 relenv: 0.17.0 smmap: Not Installed timelib: 0.2.4 Tornado: 4.5.3 ZMQ: 4.3.4 System Versions: dist: debian 12 bookworm locale: utf-8 machine: x86_64 release: 6.1.114-xenu system: Linux version: Debian GNU/Linux 12 bookworm ```
salt --versions-report (Master) (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.) ```yaml Salt Version: Salt: 3006.9 Python Version: Python: 3.10.14 (main, Jun 26 2024, 11:44:37) [GCC 11.2.0] Dependency Versions: cffi: 1.14.6 cherrypy: unknown cryptography: 42.0.5 dateutil: 2.8.1 docker-py: Not Installed gitdb: Not Installed gitpython: Not Installed Jinja2: 3.1.4 libgit2: Not Installed looseversion: 1.0.2 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.2 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 22.0 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.19.1 pygit2: Not Installed python-gnupg: 0.4.8 PyYAML: 6.0.1 PyZMQ: 23.2.0 relenv: 0.17.0 smmap: Not Installed timelib: 0.2.4 Tornado: 4.5.3 ZMQ: 4.3.4 System Versions: dist: debian 12 bookworm locale: utf-8 machine: x86_64 release: 6.1.0-28-amd64 system: Linux version: Debian GNU/Linux 12 bookworm ```

Additional context There were multiple attempts at fixing this, but those mostly seemed to address ZeroMQ, e.g Bug #61837 and PRs #65508, #65061, #65247 and #65559

yiip87 commented 18 hours ago

Current situation on one of our masters:

root@salt01:~# ./concache-cli.py
=> Got 3000 connected minions from ConCache

root@salt01:~# ./ckminions-cli.py
=> Got 3000 connected minions from CkMinions

root@salt01:~# lsof -ni :4506 | wc -l
24587

I would expect a maximum of 3000 connections to 4506, which would be one TCPReqClient connection per minion.