Open xorinzor opened 7 months ago
Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey. Please be sure to review our Code of Conduct. Also, check out some of our community resources including:
There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar. If you have additional questions, email us at saltproject@vmware.com. We’re glad you’ve joined our community and look forward to doing awesome things with you!
Tried stopping both the salt-minion
and salt-master
service and clearing all Python Cache files in the /opt/saltstack/salt/lib/**/__pycache__/*.pyc
before starting the services again.
The exception still keeps appearing in the salt-master
journal. Although so far we have not observed the 100% CPU problem or minion connectivity issues anymore.
Sorry for the me-too, but me too. We've got around 500 minions on this server, it will now stop responding almost daily with this same error in the log. We don't see 100% CPU usage when its stopped. Restarting the service gets things going again for a while:
[salt.transport.zeromq:572 ][ERROR ][3352122] Exception in request handler
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 566, in request_handler
request = await asyncio.wait_for(self._socket.recv(), 0.3)
File "/opt/saltstack/salt/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
return fut.result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/_future.py", line 598, in _handle_recv
result = recv(**kwargs)
File "zmq/backend/cython/socket.pyx", line 805, in zmq.backend.cython.socket.Socket.recv
File "zmq/backend/cython/socket.pyx", line 841, in zmq.backend.cython.socket.Socket.recv
File "zmq/backend/cython/socket.pyx", line 199, in zmq.backend.cython.socket._recv_copy
File "zmq/backend/cython/socket.pyx", line 194, in zmq.backend.cython.socket._recv_copy
File "zmq/backend/cython/checkrc.pxd", line 22, in zmq.backend.cython.checkrc._check_rc
zmq.error.Again: Resource temporarily unavailable
Infra: VMware ubuntu 20.04 guest, 16G, 2 cores
Versions:
Salt Version: [2/631]
Salt: 3007.0
Python Version:
Python: 3.10.13 (main, Feb 19 2024, 03:31:20) [GCC 11.2.0]
Dependency Versions:
cffi: 1.16.0
cherrypy: unknown
dateutil: 2.8.2
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 3.1.3
libgit2: Not Installed
looseversion: 1.3.0
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.7
msgpack-pure: Not Installed
mysql-python: Not Installed
packaging: 23.1
pycparser: 2.21
pycrypto: Not Installed
pycryptodome: 3.19.1
pygit2: Not Installed
python-gnupg: 0.5.2
PyYAML: 6.0.1
PyZMQ: 25.1.2
relenv: 0.15.1
smmap: Not Installed
timelib: 0.3.0
Tornado: 6.3.3
ZMQ: 4.3.4
Salt Package Information:
Package Type: onedir
System Versions:
dist: ubuntu 20.04.6 focal
locale: utf-8
machine: x86_64
release: 5.4.0-173-generic
system: Linux
version: Ubuntu 20.04.6 focal
Shout if there's any more info that would be helpful
Happened again today. This time it's the /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorkerQueue
process causing problems.
strace
shows a loop:
getpid() = 138187
poll([{fd=49, events=POLLIN}], 1, 0) = 0 (Timeout)
poll([{fd=47, events=POLLIN}, {fd=49, events=POLLIN}], 2, 0) = 0 (Timeout)
getpid() = 138187
poll([{fd=47, events=POLLIN}], 1, 0) = 0 (Timeout)
getpid() = 138187
poll([{fd=49, events=POLLIN}], 1, 0) = 0 (Timeout)
poll([{fd=47, events=POLLIN}, {fd=49, events=POLLIN}], 2, 0) = 0 (Timeout)
getpid() = 138187
poll([{fd=47, events=POLLIN}], 1, 0) = 0 (Timeout)
getpid() = 138187
poll([{fd=49, events=POLLIN}], 1, 0) = 0 (Timeout)
poll([{fd=47, events=POLLIN}, {fd=49, events=POLLIN}], 2, 0) = 0 (Timeout)
getpid() = 138187
poll([{fd=47, events=POLLIN}], 1, 0) = 0 (Timeout)
File descriptors point to a stream.
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
/opt/salt 138187 root 47u unix 0xffff8c57afd7d980 0t0 3514615 type=STREAM
/opt/salt 138187 root 49u unix 0xffff8c57afd7e640 0t0 3514617 type=STREAM
I'm having the same issue. Running v3007.0-0 on master and minions. Master's are on Rocky Linux 8 and minions are a mix of RL8 and CentOS 7.
We are also seeing this problem, running v3007.0 on Ubuntu 22.04. Salt Master seems to run fine for a while and then locks up with same errors as seen in original issue. Restarting the salt-master service fixes the problem but it seems to occur again within 24 hours. However we are not seeing 100% CPU usage problem in fact CPU usage remains very low on our master.
Rolled back our minions and master to 3006.8 and the issue stopped occurring. I suspect this has something to do with the tcp transport changes implemented with ZMQ in 3007.x
I'm have the same problem.
v3006.5
. When master upgraded v3007.1
from v3006.5
, the master will most likely not be able to contact the minion.
salt '*' test.ping
, will return Minion did not return. [Not connected]
Hello,
Same problem in my side.
We are on rocky linux 9 with version 3005.4
.
I had to restart salt-master every day ( or even more often depending on the day ).
@henri9813 Salt 3005.4 is EOL, recommend upgrading to a currently supported version of Salt, 3006.9 or 3007.1. @dwoz This might interest you in zeromq problems introduced with 3007.x
Sorry, I badly read the dnf info, I'm on Version : 3007.1
We are also seeing this problem, running v3007.0 on Ubuntu 22.04. Salt Master seems to run fine for a while and then locks up with same errors as seen in original issue. Restarting the salt-master service fixes the problem but it seems to occur again within 24 hours. However we are not seeing 100% CPU usage problem in fact CPU usage remains very low on our master.
Do you have vulnerability scanning going on in your environments? I had the same errors showing up in my logs before my masters would become unresponsive but CPU usage didn't spike. I was able to correlate the logs with when vulnerability scans were occurring and prodding at the master's ports.
More detail in my issue: #66519
@clayoster That's really interesting - yes we do, and we also have the same software you mentioned in your issue (Nessus Tenable Vulnerability Scanning). This would make a lot of sense as well as it always seemed like the Salt Master froze around a certain time of day (early morning AM usually). Did you find any way to prevent the issue? I am wondering if blocking connections from the scanning systems to those ports would do it.
@clayoster we don't have something like that running.
For us it also appeared to happen really randomly, even while actively monitoring the server and it's journals.
Currently we have downgraded and not had any issues since.
In my case, my instance is public, so it's possible it has been scanned by random bot.
@jsansone-pw - Yes that did successfully prevent the issue for me. I used a local firewall (nftables) to block connections to TCP 4505-4506 from the scan engine IPs and that kept my masters stable for over a month.
Description
We have 2 different VMs, each running a separate salt-master for different environments and recently upgraded to 3007 (onedir). At random the
journalctl -xeu salt-master
will show the error below, after which the/usr/bin/salt-master ReqServer ReqServer_ProcessManager RequestServer.zmq_device
will use 100% CPU and no more events are processed by the salt-master (salt-run state.event pretty=True
displays nothing).strace
on theReqServer_ProcessManager RequestServer.zmq_device
process shows this over and over again:Executing
service salt-master stop
then shows this:Looking around the server there's nothing that stands out as being wrong. Using the information provided on the troubleshooting page of saltstack unfortunately did not help resolve the issue.
Setup
Screenshots If applicable, add screenshots to help explain your problem.
Versions Report
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.) ```yaml Salt Version: Salt: 3007.0 Python Version: Python: 3.10.13 (main, Feb 19 2024, 03:31:20) [GCC 11.2.0] Dependency Versions: cffi: 1.14.6 cherrypy: 18.8.0 dateutil: 2.8.2 docker-py: Not Installed gitdb: Not Installed gitpython: Not Installed Jinja2: 3.1.3 libgit2: 1.6.4 looseversion: 1.3.0 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.7 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 23.1 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.19.1 pygit2: 1.12.1 python-gnupg: 0.5.2 PyYAML: 6.0.1 PyZMQ: 25.1.2 relenv: 0.15.1 smmap: Not Installed timelib: 0.3.0 Tornado: 6.3.3 ZMQ: 4.3.4 Salt Package Information: Package Type: onedir System Versions: dist: ubuntu 20.04.6 focal locale: utf-8 machine: x86_64 release: 5.4.0-173-generic system: Linux version: Ubuntu 20.04.6 focal ```Additional context Add any other context about the problem here.