Open tjyang opened 5 months ago
@KiyoIchikawa, do you still have this issue ?
We tested on Windows minions and Linux minions. The Windows minions seem to be working better, but they both seem to be getting the following error(s) while applying a state.test
.
Linux Minion (OEL 8)
2024-03-28 13:32:25,967 [salt.utils.extmods:155 ][ERROR ][1375064] Failed to sync modules module: File client timed out after 60 seconds
Windows (Windows Server 2016 Datacenter)
2024-03-28 12:50:34,269 [salt.loaded.int.beacons.adb:23 ][ERROR ][14288] Unable to load adb beacon: adb is missing.
2024-03-28 12:50:34,285 [salt.loaded.int.beacons.aix_account:23 ][ERROR ][14288] Unable to load aix_account beacon: Only available on AIX systems.
2024-03-28 12:50:34,332 [salt.loaded.int.beacons.avahi_announce:62 ][ERROR ][14288] Unable to load avahi_announce beacon: The 'python-avahi' dependency is missing.
2024-03-28 12:50:34,347 [salt.loaded.int.beacons.bonjour_announce:31 ][ERROR ][14288] Unable to load bonjour_announce beacon: pybonjour library is missing.
2024-03-28 12:50:34,347 [salt.loaded.int.beacons.btmp:134 ][ERROR ][14288] Unable to load btmp beacon: /var/log/btmp does not exist.
2024-03-28 12:50:34,519 [salt.loaded.int.beacons.glxinfo:23 ][ERROR ][14288] Unable to load glxinfo beacon: glxinfo is missing.
2024-03-28 12:50:37,862 [salt.loaded.int.beacons.haproxy:24 ][ERROR ][14288] Unable to load haproxy beacon: haproxy.get_sessions is missing.
2024-03-28 12:50:37,877 [salt.loaded.int.beacons.inotify:48 ][ERROR ][14288] Unable to load inotify beacon: pyinotify library is missing.
2024-03-28 12:50:37,904 [salt.loaded.int.beacons.journald:26 ][ERROR ][14288] Unable to load journald beacon: systemd library is missing.
2024-03-28 12:50:37,921 [salt.loaded.int.beacons.load:20 ][ERROR ][14288] Unable to load load beacon: Not available for Windows systems.
2024-03-28 12:50:37,931 [salt.loaded.int.beacons.log_beacon:37 ][ERROR ][14288] Unable to load log beacon: Not available for Windows systems or when regex library is missing.
2024-03-28 12:50:38,016 [salt.loaded.int.beacons.network_settings:86 ][ERROR ][14288] Unable to load network_settings beacon: pyroute2 library is missing
2024-03-28 12:50:38,079 [salt.loaded.int.beacons.sensehat:26 ][ERROR ][14288] Unable to load sensehat beacon: sensehat.get_pressure is missing.
2024-03-28 12:50:38,094 [salt.loaded.int.beacons.sh:24 ][ERROR ][14288] Unable to load sh beacon: strace is missing.
2024-03-28 12:50:38,141 [salt.loaded.int.beacons.smartos_imgadm:42 ][ERROR ][14288] Unable to load imgadm beacon: Only available on SmartOS compute nodes.
2024-03-28 12:50:38,173 [salt.loaded.int.beacons.smartos_vmadm:42 ][ERROR ][14288] Unable to load vmadm beacon: Only available on SmartOS compute nodes.
2024-03-28 12:50:38,204 [salt.loaded.int.beacons.telegram_bot_msg:30 ][ERROR ][14288] Unable to load telegram_bot_msg beacon: telegram library is missing.
2024-03-28 12:50:38,235 [salt.loaded.int.beacons.twilio_txt_msg:31 ][ERROR ][14288] Unable to load twilio_txt_msg beacon: twilio library is missing.
2024-03-28 12:50:38,235 [salt.loaded.int.beacons.watchdog:77 ][ERROR ][14288] Unable to load watchdog beacon: watchdog library is missing.
2024-03-28 12:50:38,251 [salt.loaded.int.beacons.wtmp:163 ][ERROR ][14288] Unable to load wtmp beacon: /var/log/wtmp does not exist.
2024-03-28 12:51:37,367 [salt.utils.extmods:155 ][ERROR ][19124] Failed to sync states module: File client timed out after 60 seconds
2024-03-28 12:53:55,975 [salt.utils.extmods:155 ][ERROR ][19124] Failed to sync output module: message authentication failed
2024-03-28 12:55:05,277 [salt.utils.extmods:155 ][ERROR ][19124] Failed to sync proxy module: File client timed out after 60 seconds
2024-03-28 12:56:14,469 [salt.utils.extmods:155 ][ERROR ][19124] Failed to sync serializers module: File client timed out after 60 seconds
The Linux minion keeps getting timeouts, I have not increased the log-level on the master yet.
@KiyoIchikawa , thanks for the update. your case above doesn't fit the issue I am facing here. When salt-master needed restart, it can't even issue command itself(sudo salt saltmaster01 test.ping" . And yet "saltmaster01$sudo salt-call test.ping" works.
Here is more info when salt commands can't be sent out.
[me@salt01 ~]$ date ;nc -zv salt01 4505;nc -zv salt01 4506
Fri Apr 5 11:09:23 CDT 2024
Ncat: Version 7.92 ( https://nmap.org/ncat )
Ncat: TIMEOUT.
Ncat: Version 7.92 ( https://nmap.org/ncat )
Ncat: Connected to 100.65.184.160:4506.
Ncat: 0 bytes sent, 0 bytes received in 0.02 seconds.
[me@salt01 ~]$
# ip address masked
tcp 1001 0 0.0.0.0:4505 0.0.0.0:* LISTEN 2443253/salt-master
tcp 0 0 1.6.1.0:4505 1.4.43.33:40224 ESTABLISHED 2443253/salt-master
tcp 0 0 1.6.1.0:4505 1.4.116.28:38806 ESTABLISHED 2443253/salt-master
<snipped>
tcp 11 0 1.6.1.0:4505 1.4.81.5:60514 CLOSE_WAIT -
tcp 11 0 1.6.1.0:4505 1.4.81.5:40492 CLOSE_WAIT -
tcp 0 0 1.6.1.0:4505 1.4.44.3:52474 ESTABLISHED 2443253/salt-master
tcp 0 0 1.6.1.0:4505 1.5.176.67:43538 ESTABLISHED 2443253/salt-master
tcp 11 0 1.6.1.0:4505 1.4.81.5:49436 CLOSE_WAIT -
tcp 0 0 1.6.1.0:4505 1.4.226.19:36712 ESTABLISHED 2443253/salt-master
tcp 0 0 1.6.1.0:4505 1.5.178.42:35596 ESTABLISHED 2443253/salt-master
tcp 0 0 1.6.1.0:4505 1.4.11.76:48254 ESTABLISHED 2443253/salt-master
tcp 0 0 1.6.1.0:4505 1.4.29.8:40660 ESTABLISHED 2443253/salt-master
Small comment, we're facing the exact same issue since upgrading from 3006.7. At some point the master stops working and cannot send anything to the minions.
Restarting it doesn't fix anything as it cannot bind to its port:
[salt.transport.tcp:1405][ERROR ][1609] Publish server binding pub to /var/run/salt/master/master_event_pub.ipc ssl=None
EDIT:
@compi-tom,
We're already running a 3 nodes multi-masters. Restarting the master service doesn't solve the issue as it cannot bind to the 4505 port ==> reboot.
I'm considering downgrading to 3006.7
I think I tried downgrade to 3006.7 once and it didn't help this issue so I decided use 3007.0 again. if downloading to older version approach is adopted the minion need to be downgraded also. Interesting thing is that good 3007 salt-master(one that got cloned) now has same port 4505 malfunction issue. a salt-master stop/start bring up the port 4505. They are many setting in /etc/salt/master config file. it is time for me to study and understand the meaning zeromq settings.
I am following this PR: https://github.com/saltstack/salt/pull/66335 , not really know the details. but this issue looks like a publish port issue that 66335 trying to fix.
[me@salt01 srv]$ egrep '286129|MEM' salt-2024-04-24-06\:56.txt
PID STARTED TIME %MEM RSS SZ VSZ SIZE
286129 01:00:02 00:00:01 1.6 273528 185245 740980 273156
286129 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master EventPublisher
[me@salt01 srv]$
[me@salt01 srv]$ nc -zv ilclsalt01 4505
Ncat: Version 7.92 ( https://nmap.org/ncat )
Ncat: TIMEOUT.
[me@salt01 srv]$
[me@salt01 srv]$ egrep 'DATEH|memory|pgrep' Makefile
DATEH = $(shell date +"%Y-%m-%d-%R")
memory:
ps -o pid,start,time,%mem,rss,sz,vsz,size --sort -rss -p $$(pgrep -f salt-master) | head > salt-$(DATEH).txt
pgrep -af salt-master |grep -v grep >> salt-$(DATEH).txt
[me@salt01 srv]$
[me@salt01 srv]$ egrep '1\.6|MEM' salt-2024-04-25-06\:35.txt
PID STARTED TIME %MEM RSS SZ VSZ SIZE
896442 01:00:02 00:00:01 1.6 263128 182305 729220 261396 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master EventPublisher
[me@salt01 srv]$
- This is memory footprints after salt-master got restarted after 2 hours.
[me@salt01 srv]$ cat salt-2024-04-25-09\:06.txt PID STARTED TIME %MEM RSS SZ VSZ SIZE 1083486 06:39:33 00:00:01 0.8 131836 277801 1111204 213052 1083479 06:39:33 00:00:02 0.7 118252 165078 660312 128200 1083489 06:39:33 00:00:17 0.5 82504 212527 850108 124024 1083480 06:39:33 00:00:01 0.5 81732 137755 551020 83196 1083501 06:39:33 00:00:19 0.5 81552 212266 849064 122900 1083490 06:39:33 00:00:17 0.5 81484 212239 848956 122908 1083503 06:39:33 00:00:19 0.5 81184 211934 847736 121652 1083505 06:39:33 00:00:18 0.5 81148 212192 848768 122640 1083492 06:39:33 00:00:18 0.5 80784 211915 847660 121340 1083287 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master MainProcess 1083479 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master PubServerChannel._publish_daemon 1083480 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master EventPublisher 1083483 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master EventMonitor 1083485 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer ReqServer_ProcessManager 1083486 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorkerQueue 1083488 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-0 1083489 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-1 1083490 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-2 1083492 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-3 1083501 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-4 1083503 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-5 1083504 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-6 1083505 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-7 1125202 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master FileserverUpdate 1125292 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master Maintenance [me@salt01 srv]$
my 2nd salt-master failed to test.ping itself, restart salt-master doesn't work anymore. I had to reboot the RL8 OS. this definitely smell like kernel network resource depletion.
[me@salt02 ~]$ sudo salt salt02 test.ping
salt02:
True
[me@salt02 ~]$ uptime 06:11:19 up 4 min, 2 users, load average: 0.66, 0.29, 0.12 [me@salt02 ~]$
- curren zeromq setting in /etc/salt/master, Need to read https://libzmq.readthedocs.io/en/latest/
[me@salt02 ~]$ sudo egrep "^pub_hwm|^ipc_write_buffer|^salt_event_pub_hwm|^event_publisher_pub_hwm" /etc/salt/master pub_hwm: 10000 ipc_write_buffer: 'dynamic' salt_event_pub_hwm: 128000 event_publisher_pub_hwm: 64000 [me@salt02 ~]$
After downgrade from 3007 to 3006.8 version, my salt-master servers, the 4505 port is now stable and able to send out salt commands.
After downgrade from 3007 to 3006.8 version, my salt-master servers, the 4505 port is now stable and able to send out salt commands.
I've done the same as this error was preventing our app to automatically add VMs to cope with the load. Also, I didn't want to set a daily restart to "bypass" the issue.
@compi-tom, Thanks for the confirmation. Hopeful when 3007 turned into LTS from STS. This issue can be acked by core team and resolved.
Any update on this issue ? I had two servers that were working as multimasters nodes thru multiple 3006 versions, just upgraded to 3007.1 and I get this problem in one of the servers only
[DEBUG ] PubClient connecting to <salt.transport.tcp.PublishClient object at 0x7f721b129130> '/local/data/saltwin/salt-master/3007.1/var/run/salt/master/master_event_pub.ipc'
[DEBUG ] PubClient conencted to <salt.transport.tcp.PublishClient object at 0x7f721b129130> '/local/data/saltwin/salt-master/3007.1/var/run/salt/master/master_event_pub.ipc'
[DEBUG ] Subscriber at connected
[DEBUG ] tcp stream to closed, unable to recv
We also had to downgrade our master to 3006.8 to resolve this issue. 3007.0 and 3007.1 did not work and we had to restart salt-master multiple times every day.
No logs are written with Loglevel WARNING. Master just refuses to connect to any minion (even itself). Salt-Call works without issues.
Hello,
Today i made update from latest 3005 to 3007.1 and i got a lot of messages in master log file: `2024-08-25 10:10:10,192 [salt.transport.tcp:1082][DEBUG ][81648] tcp stream to closed, unable to recv
2024-08-25 10:10:10,192 [salt.transport.tcp:1102][DEBUG ][81648] Subscriber at connected
2024-08-25 10:10:10,193 [salt.transport.tcp:1082][DEBUG ][81648] tcp stream to closed, unable to recv
2024-08-25 10:10:10,193 [salt.transport.tcp:1102][DEBUG ][81648] Subscriber at connected
2024-08-25 10:10:15,199 [salt.transport.tcp:1082][DEBUG ][81648] tcp stream to closed, unable to recv
2024-08-25 10:10:15,199 [salt.transport.tcp:1102][DEBUG ][81648] Subscriber at connected
2024-08-25 10:10:15,200 [salt.transport.tcp:1082][DEBUG ][81648] tcp stream to closed, unable to recv
2024-08-25 10:10:15,201 [salt.transport.tcp:1102][DEBUG ][81648] Subscriber at connected `
It seems 3007.1 is not stable and we cannot use it because communication with minions does not work (we have about 2500 minions connected to the master) After rollback to latest 3006 all works okay.
Just for information that problem still exists.
Any fix around?
See also #66288 and potentially #66715 There are real issues on 3007+ with minion communications and very little in terms of response to all these.
Description
3007 master not responding salt commands unless restarting the master. But it does answer salt-call ran from minion The timeout issue does not have log in /var/log/salt/master file with log file level set to debug. Currently trying the trace level.
Setup (Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)
Please be as specific as possible and give set-up details.
Steps to Reproduce the behavior $sudo salt minion-on-saltmaster test.ping
Expected behavior salt-master should answer the test.ping command quickly. In stead,it timeout.
Screenshots
cmd line
Summary
of minions targeted: 1
of minions returned: 0
of minions that did not return: 1
of minions with errors: 0
ERROR: Minions returned with non-zero exit code [me@salt01 ~]$
2024-03-28 02:36:29,940 [salt.loader.lazy ][DEBUG ] The functions from module 'no_return' are being loaded by dir() on the loaded module 2024-03-28 02:36:29,940 [salt.utils.lazy ][DEBUG ] LazyLoaded no_return.output 2024-03-28 02:36:29,941 [salt.output ][TRACE ] data = {'salt01': 'Minion did not return. [No response]\nThe minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:\n\nsalt-run jobs.lookup_jid 20240328073609814061'}
Mar 27 13:09:15 salt01 salt-master[2162920]: [ERROR ] Publish server binding pub to /var/run/salt/master/master_event_pub.ipc ssl=None Mar 27 14:00:04 salt01 salt-master[2163029]: [ERROR ] Requested method not exposed: _ext_nodes Mar 27 15:00:04 isalt01 salt-master[2163274]: [ERROR ] Requested method not exposed: _ext_nodes
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.) ```yaml Salt Version: Salt: 3007.0 Python Version: Python: 3.10.13 (main, Feb 19 2024, 03:31:20) [GCC 11.2.0] Dependency Versions: cffi: 1.16.0 cherrypy: unknown dateutil: 2.8.2 docker-py: Not Installed gitdb: Not Installed gitpython: Not Installed Jinja2: 3.1.3 libgit2: Not Installed looseversion: 1.3.0 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.7 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 23.1 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.19.1 pygit2: Not Installed python-gnupg: 0.5.2 PyYAML: 6.0.1 PyZMQ: 25.1.2 relenv: 0.15.1 smmap: Not Installed timelib: 0.3.0 Tornado: 6.3.3 ZMQ: 4.3.4 Salt Package Information: Package Type: onedir System Versions: dist: rocky 8.9 Green Obsidian locale: utf-8 machine: x86_64 release: 4.18.0-513.18.1.el8_9.x86_64 system: Linux version: Rocky Linux 8.9 Green Obsidian ```