saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Get access to the Salt software package repository here:
https://repo.saltproject.io/
Apache License 2.0
14.1k stars 5.47k forks source link

[BUG] intermittent connection between master and minion #65265

Open qianguih opened 12 months ago

qianguih commented 12 months ago

Description A clear and concise description of what the bug is. I am seeing a weird connection issue in my salt setup. there are ~30 minions registered with the master. for a few of them, master couldn't connect to them anymore after a while. salt '*' test.ping failed with the following error message:

    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20230920213139507242

here are a few observations:

Setup (Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)

file_roots: base:

deploy: cmd.run:


Please be as specific as possible and give set-up details.

- [x] on-prem machine
- [ ] VM (Virtualbox, KVM, etc. please specify)
- [ ] VM running on a cloud service, please be explicit and add details
- [x] container (Kubernetes, Docker, containerd, etc. please specify)
- [ ] or a combination, please be explicit
- [ ] jails if it is FreeBSD
- [ ] classic packaging
- [ ] onedir packaging
- [x] used bootstrap to install

**Steps to Reproduce the behavior**
(Include debug logs if possible and relevant)

**Expected behavior**
A clear and concise description of what you expected to happen.

**Screenshots**
If applicable, add screenshots to help explain your problem.

**Versions Report**
<details><summary>salt --versions-report</summary>
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)

```yaml
Salt Version:
          Salt: 3006.3

Python Version:
        Python: 3.10.4 (main, Apr 20 2022, 01:21:48) [GCC 10.3.1 20210424]

Dependency Versions:
          cffi: 1.14.6
      cherrypy: unknown
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.2
       libgit2: Not Installed
  looseversion: 1.0.2
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 22.0
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.9.8
        pygit2: Not Installed
  python-gnupg: 0.4.8
        PyYAML: 6.0.1
         PyZMQ: 23.2.0
        relenv: Not Installed
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: alpine 3.14.6 
        locale: utf-8
       machine: x86_64
       release: 5.11.0-1022-aws
        system: Linux
       version: Alpine Linux 3.14.6 

Additional context Add any other context about the problem here.

welcome[bot] commented 12 months ago

Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey. Please be sure to review our Code of Conduct. Also, check out some of our community resources including:

There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar. If you have additional questions, email us at saltproject@vmware.com. We’re glad you’ve joined our community and look forward to doing awesome things with you!

brettgilmer commented 9 months ago

HI - I am seeing this same issue. I am also seeing on the periodic pings configured by the "ping_interval" minion configuration parameter. Running salt 3006.5 on both minon and master.

darkpixel commented 9 months ago

It seems worse with 3006.5 with Linux as the master when managing Windows minions.

brettgilmer commented 8 months ago

This is affecting many of our endpoints. I can get them to re-establish communication by restarting the minion or the master, but they lose communication again.

brettgilmer commented 8 months ago

Restarting the saltmaster seems to fix the issue for all minions, for a while, but the issue will return after about 12 hours on a different seemingly random selection of minions.

raddessi commented 7 months ago

I seem to have a very similar issue with 3006.x but in my case restarting the master does not have any effect and only a minion restart resolves the issue.

Another oddity is that I can see in the minion logs that the minion is still receiving commands from the master and is able to execute them just fine but the master seemingly never receives the response data. If I issue a salt '*' service.restart salt-minion from the master all of the minions receive the command and restart and pop back up just fine and then communication will work for probably another 12 hours or so.

I don't recall having this issue on 3005.x but I have not downgraded that far yet.. so far both 3006.5 and 3006.4 minions have the problem for me. I'll try to run a tcpdump if I have time

ReubenM commented 7 months ago

I am encountering similar issues. Everything is 3006.5.

I've spent two day's thinking I broke something in some recent changes I made, but I've found that the minions jobs are succeeding, but they timeout trying to communicate back to the master. I'm thinking this may be related to concurrency + load. I use this for testing environment automation, and during tests I have concurrent jobs fired off by the scheduler for test data collections. And that is where the issues start to show up in the logs. When this happens, the minions seem to try to re-send the data which just compounds the problem. The logs on the master show that it is getting the messages because it is flagging duplicate messages, but something seems to be getting lost processing the return data.

The traces all look the same and seem to indicate something is getting dropped in concurrency related code:

2024-01-29 15:22:57,215 [salt.master      :1924][ERROR   ][115353] Error in function minion_pub:
Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 1910, in pub
    payload = channel.send(payload_kwargs, timeout=timeout)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/asynchronous.py", line 125, in wrap
    raise exc_info[1].with_traceback(exc_info[2])
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/asynchronous.py", line 131, in _target
    result = io_loop.run_sync(lambda: getattr(self.obj, key)(*args, **kwargs))
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 459, in run_sync
    return future_cell[0].result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 338, in send
    ret = yield self._uncrypted_transfer(load, timeout=timeout)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 309, in _uncrypted_transfer
    ret = yield self.transport.send(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 909, in send
    ret = yield self.message_client.send(load, timeout=timeout)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 589, in send
    recv = yield future
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
salt.exceptions.SaltReqTimeoutError: Message timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 387, in run_job
    pub_data = self.pub(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 1913, in pub
    raise SaltReqTimeoutError(
salt.exceptions.SaltReqTimeoutError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1918, in run_func
    ret = getattr(self, func)(load)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1839, in minion_pub
    return self.masterapi.minion_pub(clear_load)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/daemons/masterapi.py", line 952, in minion_pub
    ret["jid"] = self.local.cmd_async(**pub_load)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 494, in cmd_async
    pub_data = self.run_job(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/client/__init__.py", line 409, in run_job
    raise SaltClientError(general_exception)
darkpixel commented 7 months ago

I just discovered something.

At any random time I might have 25-50 minions that don't appear to respond to jobs. They may or may not respond to test.ping, but definitely won't respond to something like state.sls somestate.

...buuuut they ARE actually are listening to the master.

So my workflow is stupidly:

salt '*' minion.restart
# Wait for every minion to return that it failed to respond
salt '*' state.sls somestate
raddessi commented 7 months ago

@darkpixel Yes I have found the same thing and have the same workflow. Something just gets stuck and responses get lost somewhere. They are always receiving events however as you say, in my experience

raddessi commented 6 months ago

This seems to still be an issue on 3006.7 when both minion and master are the same version

darkpixel commented 6 months ago

3007.0 is...worse?

Woke up to all ~600 minions in an environment being offline.

salt '*' test.ping failed for every minion.

The log showed returns from every minion, but the master spit out Minion did not return: [Not connected] for every single one.

Restarted the salt-master service, got distracted for ~15 minutes, ran another test.ping and several hundred failed to respond.

Used Cluster SSH to connect in to every machine I can reach across the internet and restarted the salt-minion service and I'm down to a mix of ~60 (Windows, Linux, and BSD) that don't respond and I can't reach. Maybe 10 of them are 3006.7.

I'd love to test/switch to a different transport like websockets that would probably be more stable, but it appears to be "all or nothing". If I switch to websockets on the master, it looks like every minion will disconnect unless I also update them to use websockets...and if I update them to use websockets and something breaks, I'm going to have to spend the next month trying to get access to hosts to fix salt-minion.

darkpixel commented 5 months ago

It just happened on my master which is 3007.0...

I was going highstate on a minion that involves certificate signing and it refused to generate the certificate with no error messages in the salt master log.

I tried restarting the salt master, no dice.

About 10 minutes later I decided to restart the salt master's minion...and suddenly certificate signing worked.

The minion on the master wasn't communicating with the master...locally...on the same box...

gregorg commented 5 months ago

It just happened on my master which is 3007.0...

Try some zmq tuning, I did it on my 3006.4 (latest really stable version) :

# The number of salt-master worker threads that process commands
# and replies from minions and the Salt API
# Default: 5
# Recommendation: 1 worker thread per 200 minions, max 1.5x cpu cores
# 24x1.5 = 36, should handle 7200 minions
worker_threads: 96

# The listen queue size / backlog
# Default: 1000
# Recommendation: 1000-5000
zmq_backlog: 2000

# The publisher interface ZeroMQPubServerChannel
# Default: 1000
# Recommendation: 10000-100000
pub_hwm: 50000

# Default: 100
# Recommendation: 100-500
thread_pool: 200

max_open_files: 65535
salt_event_pub_hwm: 128000
event_publisher_pub_hwm: 64000
sasidharjetb commented 5 months ago

Where do i need to add this @gregorg in salt master and how we need to add this

gregorg commented 5 months ago

Where do i need to add this @gregorg in salt master and how we need to add this

Add this in /etc/salt/master and restart salt-master.

sasidharjetb commented 5 months ago

image

sasidharjetb commented 5 months ago

we upgrade salt to 3006.4 on master and 20 minions out of which 10 minions are not upgraded Will this solve my issue ?

gregorg commented 5 months ago

This is not a support ticket, look at salt master logs.

darkpixel commented 5 months ago

I tried those settings @gregorg. It's been intermittent for the last three days....and this morning 100% of my minions are offline (even the local one on the salt master)

If I connect to a box with a minion I show the service is running. I can totally state.highstate and everything works properly.

Restarting the master brings everything online.

There's nothing that appears unusual in the master log. I can even see minions reporting their results if I do something like salt '*' test.ping, but all I get back is Minion did not return. [Not connected].

I'd love to switch to a potentially more reliable transport, but it looks like Salt can only have one transport active at a time...so if I enable something like websockets it looks like all my minions will be knocked offline until I reconfigure them.

darkpixel commented 5 months ago

I just noticed an interesting log entry on the master. A bunch of my minions weren't talking again, even though the log had a ton of lines of "Got return from..."

So I restarted salt-master and noticed this in the log:

2024-04-07 14:40:30,306 [salt.transport.zeromq:477 ][INFO    ][2140336] MWorkerQueue under PID 2140336 is closing
2024-04-07 14:40:30,306 [salt.transport.zeromq:477 ][INFO    ][2140337] MWorkerQueue under PID 2140337 is closing
2024-04-07 14:40:30,306 [salt.transport.zeromq:477 ][INFO    ][2140318] MWorkerQueue under PID 2140318 is closing
2024-04-07 14:40:30,307 [salt.transport.zeromq:477 ][INFO    ][2140341] MWorkerQueue under PID 2140341 is closing
2024-04-07 14:40:30,307 [salt.transport.zeromq:477 ][INFO    ][2140339] MWorkerQueue under PID 2140339 is closing
2024-04-07 14:40:30,310 [salt.transport.zeromq:477 ][INFO    ][2140335] MWorkerQueue under PID 2140335 is closing
2024-04-07 14:40:30,312 [salt.transport.zeromq:477 ][INFO    ][2140319] MWorkerQueue under PID 2140319 is closing
2024-04-07 14:40:30,312 [salt.transport.zeromq:477 ][INFO    ][2140338] MWorkerQueue under PID 2140338 is closing
2024-04-07 14:40:30,316 [salt.transport.zeromq:477 ][INFO    ][2140320] MWorkerQueue under PID 2140320 is closing
2024-04-07 14:40:30,335 [salt.transport.zeromq:477 ][INFO    ][2140343] MWorkerQueue under PID 2140343 is closing
2024-04-07 14:40:30,360 [salt.transport.zeromq:477 ][INFO    ][2140333] MWorkerQueue under PID 2140333 is closing
2024-04-07 14:40:31,307 [salt.utils.process:745 ][INFO    ][2140315] Some processes failed to respect the KILL signal: Process: <Process name='MWorkerQueue' pid=2140316 parent=2140315 started> (Pid: 2140316)
2024-04-07 14:40:31,308 [salt.utils.process:752 ][INFO    ][2140315] kill_children retries left: 3
2024-04-07 14:40:31,334 [salt.utils.parsers:1061][WARNING ][2140139] Master received a SIGTERM. Exiting.
2024-04-07 14:40:31,334 [salt.cli.daemons :99  ][INFO    ][2140139] The Salt Master is shut down
2024-04-07 14:40:32,241 [salt.cli.daemons :83  ][INFO    ][2407186] Setting up the Salt Master

Specifically this:

2024-04-07 14:40:31,307 [salt.utils.process:745 ][INFO    ][2140315] Some processes failed to respect the KILL signal: Process: <Process name='MWorkerQueue' pid=2140316 parent=2140315 started> (Pid: 2140316)

Maybe something's hanging the MWorkerQueue?

amalaguti commented 5 months ago

It just happened on my master which is 3007.0...

Try some zmq tuning, I did it on my 3006.4 (latest really stable version) :

# The number of salt-master worker threads that process commands
# and replies from minions and the Salt API
# Default: 5
# Recommendation: 1 worker thread per 200 minions, max 1.5x cpu cores
# 24x1.5 = 36, should handle 7200 minions
worker_threads: 96

# The listen queue size / backlog
# Default: 1000
# Recommendation: 1000-5000
zmq_backlog: 2000

# The publisher interface ZeroMQPubServerChannel
# Default: 1000
# Recommendation: 10000-100000
pub_hwm: 50000

# Default: 100
# Recommendation: 100-500
thread_pool: 200

max_open_files: 65535
salt_event_pub_hwm: 128000
event_publisher_pub_hwm: 64000

Any improvement with these settings ?

darkpixel commented 5 months ago

I didn't use those exact settings because my master is smaller and has fewer minions. 8 cores * 1.5 threads per core = 12 threads = 2,400 minions (I only have ~700 on this test box).

It's no longer dropping all the minions every few hours...it's more like once or twice a week.

darkpixel commented 4 months ago

Also, I'm not sure if this is related or not, but it seems to be in the same vein--communication between the minions and master is pretty unreliable.

root@salt:~# salt 'US*' state.sls win.apps
'str' object has no attribute 'pop'

root@salt:~# salt 'US*' state.sls win.apps
US-REDACTED-54:
    Data failed to compile:
----------
    The function "state.sls" is running as PID 7340 and was started at 2024, May 02 03:22:05.301817 with jid 20240502032205301817
'str' object has no attribute 'pop'

root@salt:~# service salt-master restart
root@salt:~# service salt-minion restart
# wait 30 seconds or so for minions to reconnect
root@salt:~# salt 'US*' state.sls win.apps
<snip lots of output>
-------------------------------------------
Summary
-------------------------------------------
# of minions targeted: 428
# of minions returned: 405
# of minions that did not return: 23
# of minions with errors: 6
-------------------------------------------
ERROR: Minions returned with non-zero exit code
root@salt:~# 
raddessi commented 4 months ago

Checking back in here, I think this is actually resolved for me once I got all my minions to 3007.0. I've removed all restart cron jobs and the minions appear to have been stable for days now. Is anyone else still having issues with 3007.0 minions?

communication between the minions and master is pretty unreliable

Yeah.. this may still be an issue for me as well. I'm not sure yet. I noticed some odd things last night in testing but it could be unrelated. I definitely don't have the 'str' object has no attribute 'pop' or any other error but sometimes minions do not return in time.

Rosamaha1 commented 4 months ago

Hi all,

I have the same problem on 3007.0. Master to Minion connectivity always fails. The other way is working fine.. I did some tuning of the master conf but it didn't help!

tomm144 commented 4 months ago

We currently encountering same issue. (salt-master and minions are bot 3306.7) Ping on one minion causes high load on master and minion becomes "unavailable", resp. master cannot receive answer. On minion log is this message (multiple times) May 21 09:56:28 salt-minion[276145]: [ERROR ] Timeout encountered while sending {'cmd': '_return', 'id': 'minion', 'success': True, 'return': True, 'retcode': 0, 'jid': '20240521075610312606', 'fun': 'test.ping', 'fun_args': [], 'user': 'root', '_stamp': '2024-05-21T07:56:10.447647', 'nonce': ''} request

darkpixel commented 3 months ago

3007.1 is completely dead for me. Under 3006.7 minions would slowly become unavailable over a day or two until I ran a minion.restart. Even though all commands to the minions returned "Minion did not return. [Not connected]``` they actually were connected and restarted themselves and started communicating properly.

Now under 3007.1 (skipped 3007.0 because it was pretty well b0rked for Windows minions), minions disconnect after a few minutes.

If I restart the salt master and issue a command, I'm good. If I restart the salt master and wait ~5 minutes, all the minions are offline and won't come back with a minion.restart, only restarting the salt master.

The salt master logs show a non-stop stream of "Authentication requested from" and "Authentication accepted from" messages. Typically I would get those messages right after restarting the 3006.7 master or after issuing a command like salt '*' test.ping, but they'd settle down when nothing was going on.

Now I'm getting 10-15 per second non-stop.

Using the minion on the master, I can view the logs and verify the minion doesn't receive the minion.restart command--but there are also no errors about communication issues with the master.

Even stranger, I can connect out to a minion and manually run state.highstate and it works perfectly fine. No issues communicating with the master there....just receiving commands I guess.

darkpixel commented 3 months ago

Hmm...I noticed something interesting and potentially significant. I saw the con_cache setting in the config file that defaults to false. I figured I would turn it on and see what happened.

After restarting the master, I get lots of entries like this in the log:

2024-05-28 02:39:17,506 [salt.utils.master:780 ][INFO    ][806466] ConCache 299 entries in cache

It sits there and counts up (if I'm idle or issuing a command like test.ping) until it hits about ~300-400 entries in the cache....then with no warnings or errors in the log it resets and starts counting up again in the middle of the flood of "Authentication requested from" and "Authentication accepted from" messages.

darkpixel commented 3 months ago

I downgraded the master and the minion running on the master to 3006.8 and semi-reliable connectivity appears to have been restored. All the minions are still running 3007.1 and appear to be working fine.

con_cache was probably a red herring as it keeps dropping back to 0 and counting back up constantly.

Rosamaha1 commented 3 months ago

I downgraded the master and the minion running on the master to 3006.8 and semi-reliable connectivity appears to have been restored. All the minions are still running 3007.1 and appear to be working fine.

con_cache was probably a red herring as it keeps dropping back to 0 and counting back up constantly.

I can confirm! 3007.0 for salt master was for me aswell a total mess! On minion side the version is working fine! I downgraded my salt master also to 3006.8 and everything is working fine now.. Hopefully a stable release for 3007 will soon arrive!

frenkye commented 3 months ago

Same issue upgraded master to 3007.1 and minions dropping like flyes. The crazy part, when i run tcpdump on minion and run salt 'minion' test.ping it doesn't even show single packet on minion, the connection is totally fucked. In netstat on master there can be seen established and close_wait connection to 4505 from minion.

salt-call from minion works ok, but it bypass the connection between master and minion. Since it create his own, for that call.

With latest releases salt becoming more and more unusable in prod enviroment. 3007.1 - minions doesn't work 3007.0 - windows fucked 3006.6 - grains don't work 3005 - salt-ssh was semi-working

So is very unconfortable upgrading/downgrading, when you need specific feature to work.

PS: Downgrade to 3006.8 master with 3007.x minions is working ok. Master 3007.x is not working as expected.

pholbrook commented 3 months ago

I had a recent very frustrating experience with lack of stability in the 3007.0 master. I've been attempting to upgrade one environment from 3001 to 3007 with about 1250 minions all running 3000. I was unable to to get a master to run stably on 3007.0. We had to fall back to the 3006.8 master to complete the upgrade.

Our Salt setup in this particular environment is very minimal: we're only using Salt for its remote execution facilities - we aren't running states at all. (In fact, what we do have defined for states and pillar are likely broken; we see errors about them in the logs, but it's irrelevant to how we use Salt in this case.)

Our environment is based on RHEL 7 under VMware ESX VMs. We run the master in a container. Initially we attempted to replace the 3001 master with the 3007.0 master, but every time we brought up the master, it proved unable to talk to the minions. We saw the key requests come in and get auto approved, and as others have said, we were able to execute commands on the minion VMs via salt-call, but pings or any other commands from the master failed, and we had to bring the 3007 master back up.

After several attempts at upgrading the master whole cloth, we instead embarked on a strategy of setting up a separate VM with the 3007.0 master, and then using the 3001 master to update the minion config to attach to both masters. We started with small batches, first successfully attaching minions to both. We then attempted to use the 3007.0 master via cmd.run_bg to update the minion config to only talk to 3007, and then to use yum upgrade to update the minions to 3007.0 as well. I tried using the -b and --batch-wait parameters to space out the upgrades, but kept running into points where the batch would fail because the 3007 master was completely unable to talk to any of the minions. This happened with as few as 300-400 minions attached to the 3007 master. We found that somewhere above 900 minions we were unable to get the master to start successfully at all.

Though minions checked in, the master was unable to communicate with them. For a time it seemed that stopping the container and completely removing the /var/cache/salt directory helped, but the relief was brief: as soon as we started doing anything with multiple minions, the communication blackout would fall.

We learned that restarting the master would allow connection to the minions for perhaps an hour, but then the master would lapse into being unable to talk to any minions. Minions could still communicate with masters, and from their point of view, they never lost the connection. For that matter, there was nothing in the master logs about any issues, either - just a failure to communicate.

Even though we'd never had any issues with our 3001 master and minion configs, I went through the advice about scaling Salt and made adjustments to the master VMs resources, taking it from 6GB ram and 4 processors to 8gb/8 processors. I also tried various combinations of tuning parameters in both the minion and master configs, but in the end though I was able to bring up the 3007.0 master with as many as 960 minions, the master was not stable - within an hour of starting, it would lose connection and stay that way for hours or days, and then mysteriously reconnect. After increasing the VM resources on the 3007 master, I never saw any evidence that the VM itself was under any serious load.

After 6 separate attempts over dozens of hours to get my 1250 minions upgraded, switching to a 3006.8 master is what did the trick. As soon as I brought up the 3006.8 master in place of my 3007.0 master, it was able to communicate with the already updated 3007.0 minions, and I was able to move the last few hundred VMs off my 3001 master and upgrade them to 3007.0 without any issues. That was my UAT environment, but now at least I have confidence that I should be able to complete my ~1100 minion production environment in a far more expeditious way.

The whole experience was rather sobering. In the next four months I'm going to have to update a far older version of Salt in four different environments. Those environments rely completely on Salt for all configuration from kickstarting to Kubernetes configuration. I think the dual master strategy could be helpful, but at this point, I'm definitely going to target 3006.8 rather than 3007 unless and until I hear these issues have been fixed.

darkpixel commented 3 months ago

This happened with as few as 300-400 minions attached to the 3007 master. We found that somewhere above 900 minions we were unable to get the master to start successfully at all.

I've been testing in one of my environments with ~512 minions and my experience is similar.

Though minions checked in, the master was unable to communicate with them. For a time it seemed that stopping the container and completely removing the /var/cache/salt directory helped, but the relief was brief: as soon as we started doing anything with multiple minions, the communication blackout would fall.

Yup. Since upgrading my two big experiences have been:

We learned that restarting the master would allow connection to the minions for perhaps an hour, but then the master would lapse into being unable to talk to any minions. Minions could still communicate with masters, and from their point of view, they never lost the connection. For that matter, there was nothing in the master logs about any issues, either - just a failure to communicate.

I thought the tcp_keepalive_idle minion setting might fix it. I changed it from the default of 300 down to 60. Huge mistake. The moment that setting hit all the minions, the CPU load on my master (8 cores and 16 GB RAM) spiked up to ~15 and there was zero communication with any minion--not even locally using salt-call. I reverted the setting and tried to run salt '*' state.sls salt.minion and nothing responded. I tried applying it locally via salt-call -l info state.sls salt.minion and it just sat there hanged trying to talk to the master. A minute later the master itself started timing out. Restarts didn't fix it. I ended up blocking the two port used by Salt in the firewall, and immediately the local salt-call command start running great and CPU load came down. I had to adjust the firewall for an hour slowing letting in new IPs and running salt '*' state.sls salt.minion to get them to stop trampling the master.

I think the dual master strategy could be helpful

I don't think having two masters that "just can't even" will help. ;)

Those environments rely completely on Salt for all configuration from kickstarting to Kubernetes configuration.

I feel you. We've invested a ton of time over the past ~10 years automating systems using Saltstack. I love using it. ...but over the lifetime of the 3006.x releases until now, we've had one disastrous release after another. There's always something critical or important broken in Salt that causes problems in our environment...and honestly, it's usually something related to the transport. ZMQ seems to be the wrong choice of transports in Salt.

I'd start testing and switch to websockets in a heartbeat...but at the moment it looks like you can only have one transport active on your master. So you either switch everything all at once, or you don't. And unfortunately the installers don't really support using different transports. i.e. msiexec /i salt-package-for-windows.msi MASTER=salt.example.tld uses ZMQ. You don't really have an option unless you want to somehow deploy the config to the box ahead of time and use the CUSTOM_CONFIG option to point to that config file or you "bootstrap it" via salt-call -l info --master salt.example.tld state.sls salt.minion using the salt formula.

Not being an expert on salt transports...I with multi-transport was an option, or maybe an option to ditch zeromq and use a more reliably implemented queuing system like rabbitmq.

darkpixel commented 3 months ago

I just noticed #61656...

pholbrook commented 2 months ago

I think the dual master strategy could be helpful

I don't think having two masters that "just can't even" will help. ;)

You have a fair point. I was thinking about the recently introduced master cluster concept. The write-up is brief, at least some of the information currently in /var/cache/salt (at least on RHEL systems) that the master keeps is apparently moved into a shared file system. Presumably at least the load back from minions would be distributed, but given that the bug for me is that the master loses touch with the minions while the minions can still communicate with the master, so I suspect you're right: this probably won't help.

It's also a lot more complicated. I saw a table showing that at least in theory you could handle up to 8,000 minions on a master if you had more resource. We're both far from that number.

I'm going to attempt to upgrade my production saltmaster and ~1,000 minions early tomorrow morning using the 3006.8 master. I'll report back on how it went.

darkpixel commented 2 months ago

I'm going to attempt to upgrade my production saltmaster and ~1,000 minions early tomorrow morning using the 3006.8 master. I'll report back on how it went.

Good luck. My minions have been on 3007.1 for a while now, but I had to revert the master to 3006.8 because of all the issues....but this weekend I attempted to upgrade again. It was pretty disastrous.

The master basically becomes unresponsive and authentication times out after anywhere from a few minutes to 15 minutes. All automation stops, and I have to restart the master to get things working again.

I've tried with the default config (all the tweaking and tuning options removed) and with the various configuration tweaks suggested above and over the years.

From all the documentation, an 8-core 16 GB RAM server should easily be able to handle ~600 minions.

In 3006.8 it sorta appears to be able to handle it with config tweaks. The load average is typically around 4-5, and memory is almost entirely exhausted. The ReqServer MWorker processes are always consuming a lot of CPU time.

In 3007.1 the load average hovers around 3 and only ~1.5 GB of RAM is used, but the ReqServer MWorker processes sit mostly idle. Nothing interesting in the master logs--it's all the minions that keep logging timeouts to the master when sending returns or failing to authenticate.

I can pretty reliably bork the master by connecting to ~20 minions using clusterssh and simultaneously executing salt-call -l info state.highstate. Authentication times out for all of them and they fail to do their job. Restarting the master and re-executing the command will occasionally work, but leave the master borked again.

darkpixel commented 2 months ago

Well shoot. More inclusive experimenting. I finally said "screw it--we'll see if this is a resource issue"....and I spun up a VM with 32 cores and 256 GB of memory.

I then CSSH'd out to 56 minions and ran salt-call -l debug state.highstate. Instead of 100% of them failing to authenticate to the master due to auth timeouts...only ~30% of them failed. The rest chugged away applying states...right up until 8 hanged on the x509 state. After 15 minutes they hadn't budged, so I restarted the salt-minion in charge of handling certs. 5 minutes later still nothing. So I CTRL+C'd the hanged minions and tried again. They still hanged. Funny thing is, all 56 minions needed certs renewed, it's just those 5 that didn't.

Then I decided to target ~500 minions. They all responded to pings. Then I tried a saltutil.refresh_pillar. That worked too. Then I tried a saltutil.sync_all and the master timed out. The CPU load was 0.1 on the master the entire time.

Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.

And the minions frequently say:

salt.exceptions.SaltClientError: Unable to sign_in to master: Attempt to authenticate with the salt master failed with timeout error

or

2024-06-24 19:19:42,357 [salt.minion      ][minion][ERROR   ] Timeout encountered while sending {'id': 'USWQWITOFC01', 'cmd': '_minion_event', 'pretag': None, 'tok': b'\x0b\x05--snip---\xd0\x85T', 'data': 'ping', 'tag': 'minion_ping', '_stamp': '2024-06-25T02:18:53.372283', 'nonce': '4c550af2fd314128b35106e4d32162b9'} request

I tried a few more times and it kept timing out. I restarted it and it started working again.

Not wanting to spend $3.50/hr on a mostly working salt master, I scaled it back down to 16 GB RAM and 8 cores.

So it could be resources in some cases, but it still looks like there's a fundamental connectivity problem.

Plus no one wants to spend $3k/mo on a virtual machine for a salt master than manages ~500 minions. ;)

pholbrook commented 2 months ago

I completed my production upgrade of about 1,000 VMs from salt-3000 to 3007 without incident. As noted, however, I'm not running states on these VMs - just getting salt-minions to connect so I can run commands against them.

I did use the strategy of using the older saltmaster to do all the jobs in roughly three steps:

I ran these in larger batches of about 250-300 minions, and then I ran the cmd.run_bg commands with batches of 20-25 with `--batch-wait=15 to space them out.

Load average on my 3006.8 master never went above 2.5 on my 8 cpu VM.

We'll see what the future holds when I start upgrading my other salt environments where I actually depend on salt. Right now I'm going to stick to the 3006.8 master, but I'll be closely watching 3007 release notes to see if they claim to have fixed anything that sounds like this issue.

darkpixel commented 2 months ago

I think I'm seeing a slightly higher than average failure for minions (Windows laptops) that are connected via wireless. Not knowing much about the ZMQ transport, maybe this has something to do with latent minions or minions with occasional packet loss. It's difficult to know for sure, because if a minion doesn't respond it could be this bug, or it could be the user turned off their laptop and went to bed. ;)

dwoz commented 1 month ago

I think I'm seeing a slightly higher than average failure for minions (Windows laptops) that are connected via wireless. Not knowing much about the ZMQ transport, maybe this has something to do with latent minions or minions with occasional packet loss. It's difficult to know for sure, because if a minion doesn't respond it could be this bug, or it could be the user turned off their laptop and went to bed. ;)

Do you have master_alive_interval set on those minions?

pholbrook commented 1 month ago

@darkpixel can respond with whether he is using master_alive_interval, but I thought I'd give an update on where I am with the update I gave on June 25th.

I'm still seeing minions get disconnected from my 3006.8 master. As noted, I'm using 3007.1 minions. I might go from 1,000+ minions connect down to a few hundred connected, and then later they come back.

A wrinkle in our environment of 1,000 plus minions each on two different masters is that we have a cron job that does a salt-run manage.down removekeys=True every hour. We have that job because in our environment VMs are destroyed and replaced instead of being patched. However, I think I've read that that removing old keys will result in some kind of renegotiation between master and minion, so that might be pushing on this bug.

Answering @dwoz 's question, we are using master_alive_interval=120.

darkpixel commented 1 month ago

Do you have master_alive_interval set on those minions?

Hey @dwoz. I have master_alive_interval set to 300 on the minions.

darkpixel commented 2 days ago

For even more testing fun, I bought a monster server. I threw 64 cores and 512 GB RAM at Salt. I think I've played around with nearly every reasonable combination of these settings:

    keep_jobs_seconds: 3600
    con_cache: True
    zmq_filtering: True
    gather_job_timeout: 30
    timeout: 30
    sock_pool_size: 25
    worker_threads: 10
    zmq_backlog: 2500
    pub_hwm: 2500
    salt_event_pub_hwm: 128000
    event_publisher_pub_hwm: 64000

...including trying @amalaguti 's settings above.

I still typically get all minions disconnected within ~5 minutes of restarting the salt master:

-------------------------------------------
Summary
-------------------------------------------
# of minions targeted: 456
# of minions returned: 0
# of minions that did not return: 456
# of minions with errors: 0
-------------------------------------------
gregoster commented 1 day ago

We're seeing the same issue with 3007.1 on both the master and the minions.

Watching with tcpdump on the master we see a whole bunch of the salt minions attempting to talk to the master on port 4505. They send the SYN packet, but never get SYN-ACK back, even though port 4505 was open and other systems are connected to the port. At this point were about 1000 file descripters in use by the salt master, but many of the sockets were in CLOSED_WAIT state.

We have LimitNOFILE set to 100000 in the salt-master.service file which should allow 100000 open sockets -- but it smells like LimitNOFILE is not being honored by ZMQ.

Digging in the sourece, I find: grep -r ZMQ_MAX_SOCKETS * lib/python3.10/site-packages/zmq/backend/cython/constant_enums.pxi: enum: ZMQ_MAX_SOCKETS lib/python3.10/site-packages/zmq/include/zmq.h:#define ZMQ_MAX_SOCKETS 2 lib/python3.10/site-packages/zmq/include/zmq.h:#define ZMQ_MAX_SOCKETS_DFLT 1023

Where does salt explicitly set the socket limit for ZMQ? I can't find if/where it does that -- i.e. my thought is it's unlikely that salt is setting anything, and is ending up with 1023 for the ZMQ socket limit!

Perhaps related to this: https://github.com/zeromq/libzmq/issues/4280 but, again, I can't find where ZMQ_MAX_SOCKETS is ever set in the salt code.

After restarting the salt-master service, we see about 100 connections to 4505 and the same to 4506, and things work normally. I expect things will break once the number of file descriptors goes above 1023 again.

pholbrook commented 1 day ago

Referencing my comment here about our upgrade project.

TL;DR: We’ve been stable upgrading to 3006.9 for both minions and masters.

However, our 3006.8 master / 3007.0 minion environment mentioned above has not been stable. For example, in one environment, I expect to see about 1200 minions, but when I log in and run `salt-run manage.up | wc -l', I might see 1200, but I also might see 900, or 600, or some other number. If I restart the salt master, they all come back and remain connected for a while (I’m not sure how long). We’ve been living with it because we only use Salt in these environments for occasional command orchestration, and we could tolerate restarting the salt master.

Our experience with that previous upgrade project, combined with the continued instability, gave me serious heartburn about our upcoming upgrade project, which will upgrade ~4k minions across five different environments by November 1st.

For this project, we’re upgrading from an older version of Salt — one prior to the Python 3 cutover. These environments rely on Salt for all configuration, from kickstart to setting up Kubernetes, so it has to work. We’re using Red Hat 7 for all these servers.

Because of the issues we encountered in the previous upgrades I did in June, I decided to go with a 3006.9 master and 3006.9 minions.

Our first upgrade was in an environment with only 115 minions, so that didn’t really test anything. The upgrades went fine. The biggest hassle was dealing with Git repos and branches. We have one repo for the Salt states, another for the pillar, and we have a bunch of branches for each. We had previously elected to fork our repos from the older version of Salt instead of adding more branches. The conversion to Python 3 forced a few changes, but the biggest hassle was making sure all the branches were synced and up to date with all the changes we’d made after forking Salt.

Last week, we hit our first real production environment — this one has about 930 minions, all RHEL 7 VMs running under VMware ESX. Our salt master VM is modestly configured, the same as our older master, with 4 CPUs and 6 GB of RAM. The master is running in a Docker container.

Amazingly, I saw no issues while upgrading. I started by upgrading 400 unused VMs. I did these in batches of 10 from the old salt master, with a script that immediately shut down the minion, copied in a new minion config and Yum repo information, did a yum upgrade salt-3006.9, started the minion, and then ran saltutil.sync_all.

This was all done in a shell script started under nohup by the old salt master. I think it took about 10-15 minutes to upgrade 400 minions. The load average on the master got as high as 5-6, but memory and CPU usage were always under control. I upgraded the rest without incident and got up to 906 minions. Because of the instability we’d seen previously, I set up an hourly cron job to run salt-run manage.up | wc -l.

The number remained stable for 5 days. We saw 13 minions drop last night — no errors on the master, just a lost connection error on the minion. We had not set the master_retries option for the minions, which means our minions would retry once and then shut down. My best guess is that we had a network glitch; the minions that dropped were on a different subnet. I’ll have to set master_retries: -1in the future to keep them retrying.

This all bodes well. We have about 1,000 minions in our next environment, and we’ve already done almost that many without issues. Our last environment will be a bit trickier; it currently has about 2,000 minions, but we want to split that in half. I expect some challenges around that, but hopefully, they’ll be related to the specifics of our environments and not with Salt itself.

gregorg commented 1 day ago

Hi, following this issue from the beginning, stuck on 3006.4 version. However, got some issues with minions that lost connection with the master, after a lots of debugging I found the cause: one grain was too fat for ZMQ transport: status.all_status I disabled this grain and splitted into status.* grains, now it has been 4 weeks that no minion lost connection to the master.

I won't upgrade to 3007.x till this issue is not fixed, hope this helps though.

pholbrook commented 23 hours ago

one grain was too fat for ZMQ transport: status.all_status I disabled this grain and splitted into status.* grains, now it has been 4 weeks that no minion lost connection to the master.

I've never used this. What do you mean by disabled? Do you mean you used to be pulling it, and then you stopped? Or did you do something in configs?

gregorg commented 22 hours ago

I've never used this. What do you mean by disabled? Do you mean you used to be pulling it, and then you stopped? Or did you do something in configs?

Related to issue https://github.com/saltstack/salt/issues/66562 I stopped pulling it.