saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Get access to the Salt software package repository here:
https://repo.saltproject.io/
Apache License 2.0
14.1k stars 5.47k forks source link

[BUG] 3007 master 4505 publish port failed to respond to salt commands unless restart #66282

Open tjyang opened 5 months ago

tjyang commented 5 months ago

Description

3007 master not responding salt commands unless restarting the master. But it does answer salt-call ran from minion The timeout issue does not have log in /var/log/salt/master file with log file level set to debug. Currently trying the trace level.

log_level_logfile: trace

Setup (Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)

Please be as specific as possible and give set-up details.

Steps to Reproduce the behavior $sudo salt minion-on-saltmaster test.ping

Expected behavior salt-master should answer the test.ping command quickly. In stead,it timeout.

Screenshots


Summary

of minions targeted: 1

of minions returned: 0

of minions that did not return: 1

of minions with errors: 0


ERROR: Minions returned with non-zero exit code [me@salt01 ~]$

- and this is trace log in /etc/salt/master when a minion not responding.

2024-03-28 02:36:29,940 [salt.loader.lazy ][DEBUG ] The functions from module 'no_return' are being loaded by dir() on the loaded module 2024-03-28 02:36:29,940 [salt.utils.lazy ][DEBUG ] LazyLoaded no_return.output 2024-03-28 02:36:29,941 [salt.output ][TRACE ] data = {'salt01': 'Minion did not return. [No response]\nThe minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:\n\nsalt-run jobs.lookup_jid 20240328073609814061'}

- salt-master status

Mar 27 13:09:15 salt01 salt-master[2162920]: [ERROR ] Publish server binding pub to /var/run/salt/master/master_event_pub.ipc ssl=None Mar 27 14:00:04 salt01 salt-master[2163029]: [ERROR ] Requested method not exposed: _ext_nodes Mar 27 15:00:04 isalt01 salt-master[2163274]: [ERROR ] Requested method not exposed: _ext_nodes

Mar 28 03:00:04 salt01 salt-master[2163147]: [ERROR ] Requested method not exposed: _ext_nodes Mar 28 04:00:07 salt01 salt-master[2163282]: [ERROR ] Requested method not exposed: _ext_nodes ``` **Versions Report**
salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.) ```yaml Salt Version: Salt: 3007.0 Python Version: Python: 3.10.13 (main, Feb 19 2024, 03:31:20) [GCC 11.2.0] Dependency Versions: cffi: 1.16.0 cherrypy: unknown dateutil: 2.8.2 docker-py: Not Installed gitdb: Not Installed gitpython: Not Installed Jinja2: 3.1.3 libgit2: Not Installed looseversion: 1.3.0 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.7 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 23.1 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.19.1 pygit2: Not Installed python-gnupg: 0.5.2 PyYAML: 6.0.1 PyZMQ: 25.1.2 relenv: 0.15.1 smmap: Not Installed timelib: 0.3.0 Tornado: 6.3.3 ZMQ: 4.3.4 Salt Package Information: Package Type: onedir System Versions: dist: rocky 8.9 Green Obsidian locale: utf-8 machine: x86_64 release: 4.18.0-513.18.1.el8_9.x86_64 system: Linux version: Rocky Linux 8.9 Green Obsidian ```
**Additional context** Add any other context about the problem here.
tjyang commented 5 months ago

@KiyoIchikawa, do you still have this issue ?

tjyang commented 5 months ago
/var/log/salt/master log with trace level enabled: after running sudo salt salt01 test.ping command it just time out on itsef , the master ``` 2024-03-28 08:33:40,794 [salt.utils.event ][TRACE ] _get_event() waited 0 seconds and received nothing 2024-03-28 08:33:40,804 [salt.utils.event ][TRACE ] Get event. tag: (salt/job|syndic/.*)/20240328133320851221 2024-03-28 08:33:40,886 [salt.utils.event ][TRACE ] Get event. tag: salt/job/20240328133325868511 2024-03-28 08:33:40,886 [salt.utils.event ][TRACE ] _get_event() waited 0 seconds and received nothing 2024-03-28 08:33:40,887 [salt.client ][DEBUG ] return event: {'salt01': {'failed': True}} 2024-03-28 08:33:40,892 [salt.loader.lazy ][DEBUG ] The functions from module 'localfs' are being loaded by dir() on the loaded module 2024-03-28 08:33:40,893 [salt.utils.lazy ][DEBUG ] LazyLoaded localfs.init_kwargs 2024-03-28 08:33:40,980 [salt.loader.lazy ][DEBUG ] The functions from module 'localfs' are being loaded by dir() on the loaded module 2024-03-28 08:33:40,980 [salt.utils.lazy ][DEBUG ] LazyLoaded localfs.init_kwargs 2024-03-28 08:33:40,982 [salt.loader.lazy ][DEBUG ] The functions from module 'no_return' are being loaded by dir() on the loaded module 2024-03-28 08:33:40,982 [salt.utils.lazy ][DEBUG ] LazyLoaded no_return.output 2024-03-28 08:33:40,982 [salt.output ][TRACE ] data = {'salt01': 'Minion did not return. [No response]\nThe minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:\n\nsalt-run jobs.lookup_jid 20240328133320851221'} 2024-03-28 08:33:40,982 [salt.transport.tcp][DEBUG ] tcp stream to closed, unable to recv 2024-03-28 08:33:41,973 [salt.utils.process][TRACE ] Process manager iteration 2024-03-28 08:33:43,754 [salt.utils.process][TRACE ] Process manager iteration 2024-03-28 08:33:51,983 [salt.utils.process][TRACE ] Process manager iteration 2024-03-28 08:33:53,765 [salt.utils.process][TRACE ] Process manager iteratio ```
/var/log/salt/master log with trace level enabled: after running sudo salt-call test.ping command. salt master is answering querying. ``` 2024-03-28 08:43:33,533 [salt.crypt ][DEBUG ] salt.crypt.get_rsa_key: Loading private key 2024-03-28 08:43:33,533 [salt.crypt ][DEBUG ] salt.crypt.sign_message: Signing message. 2024-03-28 08:43:33,533 [salt.transport.tcp][TRACE ] TCP PubServer sending payload: topic_list=None 'salt/auth\n\n��resultãact�accept�id�salt01�pub�\x01�-----BEGIN PUBLIC KEY-----\nMIIBIjANBgkqhkiG9xxxxx-masked--Z7HIgb+x+dyYNDe\nb3RFpYx7QFxzHmShDutLMyjfAR26npS8fsN1+nHcoXTkHC6kE2OjC0yc/rp5Crr6\nIQIDAQAB\n-----END PUBLIC KEY-----�_stamp�2024-03-28T13:43:33.532394' 2024-03-28 08:43:33,533 [salt.transport.tcp][TRACE ] TCP PubServer finished publishing payload 2024-03-28 08:43:33,533 [salt.master ][TRACE ] Ignore tag salt/auth 2024-03-28 08:43:33,581 [salt.master ][TRACE ] AES payload received with command _pillar 2024-03-28 08:43:33,581 [salt.pillar ][DEBUG ] Determining pillar cache 2024-03-28 08:43:33,600 [salt.loader.lazy ][DEBUG ] The functions from module 'roots' are being loaded by dir() on the loaded module 2024-03-28 08:43:33,600 [salt.utils.lazy ][DEBUG ] LazyLoaded roots.envs 2024-03-28 08:43:33,602 [salt.loader.lazy ][DEBUG ] The functions from module 's3fs' are being loaded by dir() on the loaded module 2024-03-28 08:43:33,603 [salt.utils.lazy ][DEBUG ] Could not LazyLoad roots.init: 'roots.init' is not available. 2024-03-28 08:43:33,623 [salt.loader.lazy ][DEBUG ] The functions from module 'jinja' are being loaded by dir() on the loaded module 2024-03-28 08:43:33,623 [salt.utils.lazy ][DEBUG ] LazyLoaded jinja.render 2024-03-28 08:43:33,624 [salt.loader.lazy ][DEBUG ] The functions from module 'yaml' are being loaded by dir() on the loaded module 2024-03-28 08:43:33,624 [salt.utils.lazy ][DEBUG ] LazyLoaded yaml.render 2024-03-28 08:43:33,631 [salt.master ][TRACE ] Master function call _pillar took 0.05068349838256836 seconds 2024-03-28 08:43:33,632 [salt.transport.tcp][TRACE ] TCP PubServer sending payload: topic_list=None 'minion/refresh/salt01\n\n��Minion data cache refresh�salt01�_stamp�2024-03-28T13:43:33.630948' 2024-03-28 08:43:33,632 [salt.transport.tcp][TRACE ] TCP PubServer finished publishing payload 2024-03-28 08:43:33,632 [salt.master ][TRACE ] Ignore tag minion/refresh/salt01 2024-03-28 08:43:33,785 [salt.master ][TRACE ] AES payload received with command _return 2024-03-28 08:43:33,787 [salt.loaded.int.returner.local_cache][DEBUG ] Adding minions for job 20240328134333786284: ['salt01'] 2024-03-28 08:43:33,788 [salt.utils.job ][INFO ] Got return from salt01 for job 20240328134333786284 2024-03-28 08:43:33,789 [salt.transport.tcp][TRACE ] TCP PubServer sending payload: topic_list=None 'salt/job/20240328134333786284/ret/salt01\n\n��cmd�_return�id�salt01�jid�20240328134333786284�returnçretcode\x00�fun�test.ping�fun_args��arg��tgt_type�glob�tgt�salt01�_stamp�2024-03-28T13:43:33.788313' 2024-03-28 08:43:33,789 [salt.loaded.int.returner.local_cache][DEBUG ] Reading minion list from /var/cache/salt/master/jobs/1b/c46d64efe0c85f68fb92d7c9137b70db8237f991f39268f4018c423a3aa009/.minions.p 2024-03-28 08:43:33,789 [salt.transport.tcp][TRACE ] TCP PubServer finished publishing payload 2024-03-28 08:43:33,789 [salt.master ][TRACE ] Ignore tag salt/job/20240328134333786284/ret/salt01 2024-03-28 08:43:33,789 [salt.master ][TRACE ] Master function call _return took 0.003786802291870117 seconds 2024-03-28 08:43:34,247 [salt.utils.process][TRACE ] Process manager iteration ```
KiyoIchikawa commented 5 months ago

We tested on Windows minions and Linux minions. The Windows minions seem to be working better, but they both seem to be getting the following error(s) while applying a state.test.

Linux Minion (OEL 8)

2024-03-28 13:32:25,967 [salt.utils.extmods:155 ][ERROR   ][1375064] Failed to sync modules module: File client timed out after 60 seconds

Windows (Windows Server 2016 Datacenter)

2024-03-28 12:50:34,269 [salt.loaded.int.beacons.adb:23  ][ERROR   ][14288] Unable to load adb beacon: adb is missing.
2024-03-28 12:50:34,285 [salt.loaded.int.beacons.aix_account:23  ][ERROR   ][14288] Unable to load aix_account beacon: Only available on AIX systems.
2024-03-28 12:50:34,332 [salt.loaded.int.beacons.avahi_announce:62  ][ERROR   ][14288] Unable to load avahi_announce beacon: The 'python-avahi' dependency is missing.
2024-03-28 12:50:34,347 [salt.loaded.int.beacons.bonjour_announce:31  ][ERROR   ][14288] Unable to load bonjour_announce beacon: pybonjour library is missing.
2024-03-28 12:50:34,347 [salt.loaded.int.beacons.btmp:134 ][ERROR   ][14288] Unable to load btmp beacon: /var/log/btmp does not exist.
2024-03-28 12:50:34,519 [salt.loaded.int.beacons.glxinfo:23  ][ERROR   ][14288] Unable to load glxinfo beacon: glxinfo is missing.
2024-03-28 12:50:37,862 [salt.loaded.int.beacons.haproxy:24  ][ERROR   ][14288] Unable to load haproxy beacon: haproxy.get_sessions is missing.
2024-03-28 12:50:37,877 [salt.loaded.int.beacons.inotify:48  ][ERROR   ][14288] Unable to load inotify beacon: pyinotify library is missing.
2024-03-28 12:50:37,904 [salt.loaded.int.beacons.journald:26  ][ERROR   ][14288] Unable to load journald beacon: systemd library is missing.
2024-03-28 12:50:37,921 [salt.loaded.int.beacons.load:20  ][ERROR   ][14288] Unable to load load beacon: Not available for Windows systems.
2024-03-28 12:50:37,931 [salt.loaded.int.beacons.log_beacon:37  ][ERROR   ][14288] Unable to load log beacon: Not available for Windows systems or when regex library is missing.
2024-03-28 12:50:38,016 [salt.loaded.int.beacons.network_settings:86  ][ERROR   ][14288] Unable to load network_settings beacon: pyroute2 library is missing
2024-03-28 12:50:38,079 [salt.loaded.int.beacons.sensehat:26  ][ERROR   ][14288] Unable to load sensehat beacon: sensehat.get_pressure is missing.
2024-03-28 12:50:38,094 [salt.loaded.int.beacons.sh:24  ][ERROR   ][14288] Unable to load sh beacon: strace is missing.
2024-03-28 12:50:38,141 [salt.loaded.int.beacons.smartos_imgadm:42  ][ERROR   ][14288] Unable to load imgadm beacon: Only available on SmartOS compute nodes.
2024-03-28 12:50:38,173 [salt.loaded.int.beacons.smartos_vmadm:42  ][ERROR   ][14288] Unable to load vmadm beacon: Only available on SmartOS compute nodes.
2024-03-28 12:50:38,204 [salt.loaded.int.beacons.telegram_bot_msg:30  ][ERROR   ][14288] Unable to load telegram_bot_msg beacon: telegram library is missing.
2024-03-28 12:50:38,235 [salt.loaded.int.beacons.twilio_txt_msg:31  ][ERROR   ][14288] Unable to load twilio_txt_msg beacon: twilio library is missing.
2024-03-28 12:50:38,235 [salt.loaded.int.beacons.watchdog:77  ][ERROR   ][14288] Unable to load watchdog beacon: watchdog library is missing.
2024-03-28 12:50:38,251 [salt.loaded.int.beacons.wtmp:163 ][ERROR   ][14288] Unable to load wtmp beacon: /var/log/wtmp does not exist.
2024-03-28 12:51:37,367 [salt.utils.extmods:155 ][ERROR   ][19124] Failed to sync states module: File client timed out after 60 seconds
2024-03-28 12:53:55,975 [salt.utils.extmods:155 ][ERROR   ][19124] Failed to sync output module: message authentication failed
2024-03-28 12:55:05,277 [salt.utils.extmods:155 ][ERROR   ][19124] Failed to sync proxy module: File client timed out after 60 seconds
2024-03-28 12:56:14,469 [salt.utils.extmods:155 ][ERROR   ][19124] Failed to sync serializers module: File client timed out after 60 seconds

The Linux minion keeps getting timeouts, I have not increased the log-level on the master yet.

tjyang commented 5 months ago

@KiyoIchikawa , thanks for the update. your case above doesn't fit the issue I am facing here. When salt-master needed restart, it can't even issue command itself(sudo salt saltmaster01 test.ping" . And yet "saltmaster01$sudo salt-call test.ping" works.

tjyang commented 5 months ago

Here is more info when salt commands can't be sent out.

compi-tom commented 5 months ago

Small comment, we're facing the exact same issue since upgrading from 3006.7. At some point the master stops working and cannot send anything to the minions.

Restarting it doesn't fix anything as it cannot bind to its port:

[salt.transport.tcp:1405][ERROR ][1609] Publish server binding pub to /var/run/salt/master/master_event_pub.ipc ssl=None

EDIT:

tjyang commented 5 months ago

@compi-tom,

compi-tom commented 5 months ago

We're already running a 3 nodes multi-masters. Restarting the master service doesn't solve the issue as it cannot bind to the 4505 port ==> reboot.

I'm considering downgrading to 3006.7

tjyang commented 5 months ago

I think I tried downgrade to 3006.7 once and it didn't help this issue so I decided use 3007.0 again. if downloading to older version approach is adopted the minion need to be downgraded also. Interesting thing is that good 3007 salt-master(one that got cloned) now has same port 4505 malfunction issue. a salt-master stop/start bring up the port 4505. They are many setting in /etc/salt/master config file. it is time for me to study and understand the meaning zeromq settings.

tjyang commented 5 months ago

I am following this PR: https://github.com/saltstack/salt/pull/66335 , not really know the details. but this issue looks like a publish port issue that 66335 trying to fix.

tjyang commented 5 months ago
tjyang commented 5 months ago

- This is memory footprints after salt-master got restarted after 2 hours.

[me@salt01 srv]$ cat salt-2024-04-25-09\:06.txt PID STARTED TIME %MEM RSS SZ VSZ SIZE 1083486 06:39:33 00:00:01 0.8 131836 277801 1111204 213052 1083479 06:39:33 00:00:02 0.7 118252 165078 660312 128200 1083489 06:39:33 00:00:17 0.5 82504 212527 850108 124024 1083480 06:39:33 00:00:01 0.5 81732 137755 551020 83196 1083501 06:39:33 00:00:19 0.5 81552 212266 849064 122900 1083490 06:39:33 00:00:17 0.5 81484 212239 848956 122908 1083503 06:39:33 00:00:19 0.5 81184 211934 847736 121652 1083505 06:39:33 00:00:18 0.5 81148 212192 848768 122640 1083492 06:39:33 00:00:18 0.5 80784 211915 847660 121340 1083287 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master MainProcess 1083479 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master PubServerChannel._publish_daemon 1083480 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master EventPublisher 1083483 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master EventMonitor 1083485 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer ReqServer_ProcessManager 1083486 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorkerQueue 1083488 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-0 1083489 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-1 1083490 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-2 1083492 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-3 1083501 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-4 1083503 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-5 1083504 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-6 1083505 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-7 1125202 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master FileserverUpdate 1125292 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master Maintenance [me@salt01 srv]$

tjyang commented 4 months ago

my 2nd salt-master failed to test.ping itself, restart salt-master doesn't work anymore. I had to reboot the RL8 OS. this definitely smell like kernel network resource depletion.


Summary

of minions targeted: 1

of minions returned: 1

of minions that did not return: 0

of minions with errors: 0


[me@salt02 ~]$ uptime 06:11:19 up 4 min, 2 users, load average: 0.66, 0.29, 0.12 [me@salt02 ~]$

- curren zeromq setting in /etc/salt/master, Need to read https://libzmq.readthedocs.io/en/latest/

[me@salt02 ~]$ sudo egrep "^pub_hwm|^ipc_write_buffer|^salt_event_pub_hwm|^event_publisher_pub_hwm" /etc/salt/master pub_hwm: 10000 ipc_write_buffer: 'dynamic' salt_event_pub_hwm: 128000 event_publisher_pub_hwm: 64000 [me@salt02 ~]$

tjyang commented 4 months ago

After downgrade from 3007 to 3006.8 version, my salt-master servers, the 4505 port is now stable and able to send out salt commands.

compi-tom commented 4 months ago

After downgrade from 3007 to 3006.8 version, my salt-master servers, the 4505 port is now stable and able to send out salt commands.

I've done the same as this error was preventing our app to automatically add VMs to cope with the load. Also, I didn't want to set a daily restart to "bypass" the issue.

tjyang commented 4 months ago

@compi-tom, Thanks for the confirmation. Hopeful when 3007 turned into LTS from STS. This issue can be acked by core team and resolved.

amalaguti commented 3 months ago

Any update on this issue ? I had two servers that were working as multimasters nodes thru multiple 3006 versions, just upgraded to 3007.1 and I get this problem in one of the servers only

[DEBUG   ] PubClient connecting to <salt.transport.tcp.PublishClient object at 0x7f721b129130> '/local/data/saltwin/salt-master/3007.1/var/run/salt/master/master_event_pub.ipc'
[DEBUG   ] PubClient conencted to <salt.transport.tcp.PublishClient object at 0x7f721b129130> '/local/data/saltwin/salt-master/3007.1/var/run/salt/master/master_event_pub.ipc'
[DEBUG   ] Subscriber at  connected
[DEBUG   ] tcp stream to  closed, unable to recv
JonasKrausch commented 3 months ago

We also had to downgrade our master to 3006.8 to resolve this issue. 3007.0 and 3007.1 did not work and we had to restart salt-master multiple times every day.

No logs are written with Loglevel WARNING. Master just refuses to connect to any minion (even itself). Salt-Call works without issues.

gotsmth commented 3 weeks ago

Hello,

Today i made update from latest 3005 to 3007.1 and i got a lot of messages in master log file: `2024-08-25 10:10:10,192 [salt.transport.tcp:1082][DEBUG ][81648] tcp stream to closed, unable to recv

2024-08-25 10:10:10,192 [salt.transport.tcp:1102][DEBUG ][81648] Subscriber at connected

2024-08-25 10:10:10,193 [salt.transport.tcp:1082][DEBUG ][81648] tcp stream to closed, unable to recv

2024-08-25 10:10:10,193 [salt.transport.tcp:1102][DEBUG ][81648] Subscriber at connected

2024-08-25 10:10:15,199 [salt.transport.tcp:1082][DEBUG ][81648] tcp stream to closed, unable to recv

2024-08-25 10:10:15,199 [salt.transport.tcp:1102][DEBUG ][81648] Subscriber at connected

2024-08-25 10:10:15,200 [salt.transport.tcp:1082][DEBUG ][81648] tcp stream to closed, unable to recv

2024-08-25 10:10:15,201 [salt.transport.tcp:1102][DEBUG ][81648] Subscriber at connected `

It seems 3007.1 is not stable and we cannot use it because communication with minions does not work (we have about 2500 minions connected to the master) After rollback to latest 3006 all works okay.

Just for information that problem still exists.

Any fix around?

afletch commented 3 weeks ago

See also #66288 and potentially #66715 There are real issues on 3007+ with minion communications and very little in terms of response to all these.