saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Get access to the Salt software package repository here:
https://repo.saltproject.io/
Apache License 2.0
14.09k stars 5.47k forks source link

Intermittent ZMQ issues #66288

Open xorinzor opened 5 months ago

xorinzor commented 5 months ago

Description

We have 2 different VMs, each running a separate salt-master for different environments and recently upgraded to 3007 (onedir). At random the journalctl -xeu salt-master will show the error below, after which the /usr/bin/salt-master ReqServer ReqServer_ProcessManager RequestServer.zmq_device will use 100% CPU and no more events are processed by the salt-master (salt-run state.event pretty=True displays nothing).

Mar 27 20:37:41 saltmaster1 salt-master[3357285]: [ERROR   ] Exception in request handler
Mar 27 20:37:41 saltmaster1 salt-master[3357285]: Traceback (most recent call last):
Mar 27 20:37:41 saltmaster1 salt-master[3357285]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 566, in request_handler
Mar 27 20:37:41 saltmaster1 salt-master[3357285]:     request = await asyncio.wait_for(self._socket.recv(), 0.3)
Mar 27 20:37:41 saltmaster1 salt-master[3357285]:   File "/opt/saltstack/salt/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
Mar 27 20:37:41 saltmaster1 salt-master[3357285]:     return fut.result()
Mar 27 20:37:41 saltmaster1 salt-master[3357285]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/_future.py", line 598, in _handle_recv
Mar 27 20:37:41 saltmaster1 salt-master[3357285]:     result = recv(**kwargs)
Mar 27 20:37:41 saltmaster1 salt-master[3357285]:   File "zmq/backend/cython/socket.pyx", line 805, in zmq.backend.cython.socket.Socket.recv
Mar 27 20:37:41 saltmaster1 salt-master[3357285]:   File "zmq/backend/cython/socket.pyx", line 841, in zmq.backend.cython.socket.Socket.recv
Mar 27 20:37:41 saltmaster1 salt-master[3357285]:   File "zmq/backend/cython/socket.pyx", line 199, in zmq.backend.cython.socket._recv_copy
Mar 27 20:37:41 saltmaster1 salt-master[3357285]:   File "zmq/backend/cython/socket.pyx", line 194, in zmq.backend.cython.socket._recv_copy
Mar 27 20:37:41 saltmaster1 salt-master[3357285]:   File "zmq/backend/cython/checkrc.pxd", line 22, in zmq.backend.cython.checkrc._check_rc
Mar 27 20:37:41 saltmaster1 salt-master[3357285]: zmq.error.Again: Resource temporarily unavailable

strace on the ReqServer_ProcessManager RequestServer.zmq_device process shows this over and over again:

getpid()                                = 95422
poll([{fd=44, events=POLLIN}], 1, 0)    = 0 (Timeout)
getpid()                                = 95422
poll([{fd=46, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=44, events=POLLIN}, {fd=46, events=POLLIN}], 2, 0) = 0 (Timeout)
getpid()                                = 95422
poll([{fd=44, events=POLLIN}], 1, 0)    = 0 (Timeout)
getpid()                                = 95422
poll([{fd=46, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=44, events=POLLIN}, {fd=46, events=POLLIN}], 2, 0) = 0 (Timeout)
getpid()                                = 95422
poll([{fd=44, events=POLLIN}], 1, 0)    = 0 (Timeout)
getpid()                                = 95422
poll([{fd=46, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=44, events=POLLIN}, {fd=46, events=POLLIN}], 2, 0) = 0 (Timeout)

Executing service salt-master stop then shows this:

Mar 27 20:37:41 saltmaster1 salt-master[3357285]: zmq.error.Again: Resource temporarily unavailable
Mar 28 09:36:02 saltmaster1 systemd[1]: Stopping The Salt Master Server...
-- Subject: A stop job for unit salt-master.service has begun execution
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A stop job for unit salt-master.service has begun execution.
--
-- The job identifier is 119404.
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: [ERROR   ] Future exception was never retrieved
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: future: <Future finished exception=ImportError("cannot import name 'wait' from partially initialized module 'multiprocessing.connection' (most likely due to a circular import) (/opt/saltstack/salt/lib/python3.10/multiprocessing/connection.py)")>
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: Traceback (most recent call last):
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/tornado/gen.py", line 234, in wrapper
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:     yielded = ctx_run(next, result)
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/process.py", line 632, in run
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:     time.sleep(10)
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1028, in _handle_signals
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:     self.destroy(signum)
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1089, in destroy
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:     self.process_manager.kill_children()
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/process.py", line 704, in kill_children
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:     p_map["Process"].join(0)
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:   File "/opt/saltstack/salt/lib/python3.10/multiprocessing/process.py", line 149, in join
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:     res = self._popen.wait(timeout)
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:   File "/opt/saltstack/salt/lib/python3.10/multiprocessing/popen_fork.py", line 39, in wait
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:     from multiprocessing.connection import wait
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1028, in _handle_signals
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:     self.destroy(signum)
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1089, in destroy
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:     self.process_manager.kill_children()
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/process.py", line 704, in kill_children
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:     p_map["Process"].join(0)
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:   File "/opt/saltstack/salt/lib/python3.10/multiprocessing/process.py", line 149, in join
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:     res = self._popen.wait(timeout)
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:   File "/opt/saltstack/salt/lib/python3.10/multiprocessing/popen_fork.py", line 39, in wait
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:     from multiprocessing.connection import wait
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: ImportError: cannot import name 'wait' from partially initialized module 'multiprocessing.connection' (most likely due to a circular import) (/opt/saltstack/salt/lib/python3.10/multiprocessing/connection.py)
Mar 28 09:36:07 saltmaster1 salt-master[3357204]: [WARNING ] Master received a SIGTERM. Exiting.
Mar 28 09:36:07 saltmaster1 salt-master[3357204]: The salt master is shutdown. Master received a SIGTERM. Exited.
Mar 28 09:36:07 saltmaster1 systemd[1]: salt-master.service: Succeeded.

Looking around the server there's nothing that stands out as being wrong. Using the information provided on the troubleshooting page of saltstack unfortunately did not help resolve the issue.

Setup

Screenshots If applicable, add screenshots to help explain your problem.

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.) ```yaml Salt Version: Salt: 3007.0 Python Version: Python: 3.10.13 (main, Feb 19 2024, 03:31:20) [GCC 11.2.0] Dependency Versions: cffi: 1.14.6 cherrypy: 18.8.0 dateutil: 2.8.2 docker-py: Not Installed gitdb: Not Installed gitpython: Not Installed Jinja2: 3.1.3 libgit2: 1.6.4 looseversion: 1.3.0 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.7 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 23.1 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.19.1 pygit2: 1.12.1 python-gnupg: 0.5.2 PyYAML: 6.0.1 PyZMQ: 25.1.2 relenv: 0.15.1 smmap: Not Installed timelib: 0.3.0 Tornado: 6.3.3 ZMQ: 4.3.4 Salt Package Information: Package Type: onedir System Versions: dist: ubuntu 20.04.6 focal locale: utf-8 machine: x86_64 release: 5.4.0-173-generic system: Linux version: Ubuntu 20.04.6 focal ```

Additional context Add any other context about the problem here.

welcome[bot] commented 5 months ago

Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey. Please be sure to review our Code of Conduct. Also, check out some of our community resources including:

There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar. If you have additional questions, email us at saltproject@vmware.com. We’re glad you’ve joined our community and look forward to doing awesome things with you!

xorinzor commented 5 months ago

Tried stopping both the salt-minion and salt-master service and clearing all Python Cache files in the /opt/saltstack/salt/lib/**/__pycache__/*.pyc before starting the services again.

The exception still keeps appearing in the salt-master journal. Although so far we have not observed the 100% CPU problem or minion connectivity issues anymore.

HG00 commented 5 months ago

Sorry for the me-too, but me too. We've got around 500 minions on this server, it will now stop responding almost daily with this same error in the log. We don't see 100% CPU usage when its stopped. Restarting the service gets things going again for a while:

[salt.transport.zeromq:572 ][ERROR   ][3352122] Exception in request handler
Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 566, in request_handler
    request = await asyncio.wait_for(self._socket.recv(), 0.3)
  File "/opt/saltstack/salt/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/_future.py", line 598, in _handle_recv
    result = recv(**kwargs)
  File "zmq/backend/cython/socket.pyx", line 805, in zmq.backend.cython.socket.Socket.recv
  File "zmq/backend/cython/socket.pyx", line 841, in zmq.backend.cython.socket.Socket.recv
  File "zmq/backend/cython/socket.pyx", line 199, in zmq.backend.cython.socket._recv_copy
  File "zmq/backend/cython/socket.pyx", line 194, in zmq.backend.cython.socket._recv_copy
  File "zmq/backend/cython/checkrc.pxd", line 22, in zmq.backend.cython.checkrc._check_rc
zmq.error.Again: Resource temporarily unavailable

Infra: VMware ubuntu 20.04 guest, 16G, 2 cores

Versions:

Salt Version:                                                                                                                                                                                                                                                         [2/631]
          Salt: 3007.0

Python Version:
        Python: 3.10.13 (main, Feb 19 2024, 03:31:20) [GCC 11.2.0]

Dependency Versions:
          cffi: 1.16.0
      cherrypy: unknown
      dateutil: 2.8.2
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.3
       libgit2: Not Installed
  looseversion: 1.3.0
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.7
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 23.1
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.19.1
        pygit2: Not Installed
  python-gnupg: 0.5.2
        PyYAML: 6.0.1
         PyZMQ: 25.1.2
        relenv: 0.15.1
         smmap: Not Installed
       timelib: 0.3.0
       Tornado: 6.3.3
           ZMQ: 4.3.4

Salt Package Information:
  Package Type: onedir

System Versions:
          dist: ubuntu 20.04.6 focal
        locale: utf-8
       machine: x86_64
       release: 5.4.0-173-generic
        system: Linux
       version: Ubuntu 20.04.6 focal

Shout if there's any more info that would be helpful

xorinzor commented 5 months ago

Happened again today. This time it's the /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorkerQueue process causing problems.

strace shows a loop:

getpid()                                = 138187
poll([{fd=49, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=47, events=POLLIN}, {fd=49, events=POLLIN}], 2, 0) = 0 (Timeout)
getpid()                                = 138187
poll([{fd=47, events=POLLIN}], 1, 0)    = 0 (Timeout)
getpid()                                = 138187
poll([{fd=49, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=47, events=POLLIN}, {fd=49, events=POLLIN}], 2, 0) = 0 (Timeout)
getpid()                                = 138187
poll([{fd=47, events=POLLIN}], 1, 0)    = 0 (Timeout)
getpid()                                = 138187
poll([{fd=49, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=47, events=POLLIN}, {fd=49, events=POLLIN}], 2, 0) = 0 (Timeout)
getpid()                                = 138187
poll([{fd=47, events=POLLIN}], 1, 0)    = 0 (Timeout)

File descriptors point to a stream.

COMMAND      PID USER   FD      TYPE             DEVICE SIZE/OFF    NODE NAME
/opt/salt 138187 root   47u     unix 0xffff8c57afd7d980      0t0 3514615 type=STREAM
/opt/salt 138187 root   49u     unix 0xffff8c57afd7e640      0t0 3514617 type=STREAM
mshields-frtservices commented 4 months ago

I'm having the same issue. Running v3007.0-0 on master and minions. Master's are on Rocky Linux 8 and minions are a mix of RL8 and CentOS 7.

jsansone-pw commented 4 months ago

We are also seeing this problem, running v3007.0 on Ubuntu 22.04. Salt Master seems to run fine for a while and then locks up with same errors as seen in original issue. Restarting the salt-master service fixes the problem but it seems to occur again within 24 hours. However we are not seeing 100% CPU usage problem in fact CPU usage remains very low on our master.

jsansone-pw commented 4 months ago

Rolled back our minions and master to 3006.8 and the issue stopped occurring. I suspect this has something to do with the tcp transport changes implemented with ZMQ in 3007.x

forever765 commented 1 month ago

I'm have the same problem.

henri9813 commented 1 month ago

Hello,

Same problem in my side.

We are on rocky linux 9 with version 3005.4.

I had to restart salt-master every day ( or even more often depending on the day ).

dmurphy18 commented 1 month ago

@henri9813 Salt 3005.4 is EOL, recommend upgrading to a currently supported version of Salt, 3006.9 or 3007.1. @dwoz This might interest you in zeromq problems introduced with 3007.x

henri9813 commented 1 month ago

Sorry, I badly read the dnf info, I'm on Version : 3007.1

clayoster commented 2 weeks ago

We are also seeing this problem, running v3007.0 on Ubuntu 22.04. Salt Master seems to run fine for a while and then locks up with same errors as seen in original issue. Restarting the salt-master service fixes the problem but it seems to occur again within 24 hours. However we are not seeing 100% CPU usage problem in fact CPU usage remains very low on our master.

Do you have vulnerability scanning going on in your environments? I had the same errors showing up in my logs before my masters would become unresponsive but CPU usage didn't spike. I was able to correlate the logs with when vulnerability scans were occurring and prodding at the master's ports.

More detail in my issue: #66519

jsansone-pw commented 2 weeks ago

@clayoster That's really interesting - yes we do, and we also have the same software you mentioned in your issue (Nessus Tenable Vulnerability Scanning). This would make a lot of sense as well as it always seemed like the Salt Master froze around a certain time of day (early morning AM usually). Did you find any way to prevent the issue? I am wondering if blocking connections from the scanning systems to those ports would do it.

xorinzor commented 2 weeks ago

@clayoster we don't have something like that running.

For us it also appeared to happen really randomly, even while actively monitoring the server and it's journals.

Currently we have downgraded and not had any issues since.

henri9813 commented 2 weeks ago

In my case, my instance is public, so it's possible it has been scanned by random bot.

clayoster commented 1 week ago

@jsansone-pw - Yes that did successfully prevent the issue for me. I used a local firewall (nftables) to block connections to TCP 4505-4506 from the scan engine IPs and that kept my masters stable for over a month.