[BUG] Getting the error "An extra return was detected from minion" with Multimaster Syndic setup post upgrade to 3006.3 one dir

anandarajan-vivekanandam-agilysys commented 11 months ago

Description We are using an M.O.M and multimaster syndic setup. We were able to scale upto 20000 minions with the minion swarm without any issues when we were using the salt version 3004.2

We recently upgraded the saltmaster and syndic to 3006.3 one dir version. Post upgrade, the M.O.M master is getting spammed with the error message "salt-master[22011]: [ERROR ] An extra return was detected from minion XXXXXXX-XXXXXX-XXXXXX, please verify the minion, this could be a replay attack"

This is affecting the performance and master, eventually facing error in Syndic too. Pasting the syndic log below. Nov 6 10:43:26 mi-syndic-master-01-perf salt-syndic: [ERROR ] Unable to call _return_pub_multi on cnc-perf-mi.hospitalityrevolution.com, trying another... Nov 6 10:43:41 mi-syndic-master-01-perf salt-syndic: [WARNING ] The minion failed to return the job information for job 20231106102101948872. This is often due to the master being shut down or overloaded. If the master is running, consider increasing the worker_threads value. Nov 6 10:43:41 mi-syndic-master-01-perf salt-syndic: [ERROR ] Unable to call _return_pub_multi on cnc-perf-mi.hospitalityrevolution.com, trying another... Nov 6 10:44:04 mi-syndic-master-01-perf salt-syndic: [WARNING ] The minion failed to return the job information for job 20231106102101948872. This is often due to the master being shut down or overloaded. If the master is running, consider increasing the worker_threads value. Nov 6 10:44:05 mi-syndic-master-01-perf salt-syndic: [ERROR ] Unable to call _return_pub_multi on cnc-perf-mi.hospitalityrevolution.com, trying another...

Moreover, even after shutting down , we are seeing the same message spamming the logs and the minion return are capture by the salt event bus event_bus.get_event( tag='salt/job//ret/',match_type='fnmatch')

Setup We have a single M.O.M and 4 Syndic Masters ( All the syndic masters shares same pub key since they are set in multimaster mode). We also have an minion swarm setup for scale testing purpose.

M.O.M and Syndic masters use Salt version 3006.3 one dir version Swarm minion uses the salt minion version 3004.2. Attached the minion swarm python code. MinionSwarm.zip

Please be as specific as possible and give set-up details.

[ ] M.O.M and Syndic masters are VM running on a cloud service, please be explicit and add details. Minion swarm is VM scaleset in Cloud

Steps to Reproduce the behavior Use Minion swarm with the M.O.M and Multimaster Syndic setup.

Expected behavior Minions should connect without issues.

Screenshots

Versions Report M.O.M: Salt Version: Salt: 3006.3

Python Version: Python: 3.10.13 (main, Sep 6 2023, 02:11:27) [GCC 11.2.0]

Dependency Versions: cffi: 1.14.6 cherrypy: unknown dateutil: 2.8.1 docker-py: Not Installed gitdb: Not Installed gitpython: Not Installed Jinja2: 3.1.2 libgit2: Not Installed looseversion: 1.0.2 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.2 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 22.0 pycparser: 2.21 pycrypto: 3.16.0 pycryptodome: Not Installed pygit2: Not Installed python-gnupg: 0.4.8 PyYAML: 6.0.1 PyZMQ: 23.2.0 relenv: 0.13.10 smmap: Not Installed timelib: 0.2.4 Tornado: 4.5.3 ZMQ: 4.3.4

System Versions: dist: centos 7.9.2009 Core locale: utf-8 machine: x86_64 release: 3.10.0-1160.88.1.el7.x86_64 system: Linux version: CentOS Linux 7.9.2009 Core

Syndic Master: Salt Version: Salt: 3006.3

Python Version: Python: 3.10.13 (main, Sep 6 2023, 02:11:27) [GCC 11.2.0]

Dependency Versions: cffi: 1.14.6 cherrypy: unknown dateutil: 2.8.1 docker-py: Not Installed gitdb: Not Installed gitpython: Not Installed Jinja2: 3.1.2 libgit2: Not Installed looseversion: 1.0.2 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.2 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 22.0 pycparser: 2.21 pycrypto: 3.16.0 pycryptodome: Not Installed pygit2: Not Installed python-gnupg: 0.4.8 PyYAML: 6.0.1 PyZMQ: 23.2.0 relenv: 0.13.10 smmap: Not Installed timelib: 0.2.4 Tornado: 4.5.3 ZMQ: 4.3.4

System Versions: dist: centos 7.9.2009 Core locale: utf-8 machine: x86_64 release: 3.10.0-1160.95.1.el7.x86_64 system: Linux version: CentOS Linux 7.9.2009 Core

welcome[bot] commented 11 months ago

Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey. Please be sure to review our Code of Conduct. Also, check out some of our community resources including:

There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar. If you have additional questions, email us at saltproject@vmware.com. We’re glad you’ve joined our community and look forward to doing awesome things with you!

OrangeDog commented 11 months ago

See also #65301

anandarajan-vivekanandam-agilysys commented 11 months ago

Have narrowed down the problem to Syndic. Have tried to replicate the same issue in an direct master to minion approach. The issue did not occur and managed to scale up up to 15000 minions using minion Swarm. Also, in the M.O.M and mulitmaster syndic master approach, the issue occurs both with multimaster (syndic master) and single master approach.

anandarajan-vivekanandam-agilysys commented 11 months ago

The same issue happened with salt version 3004.2, but started happening beyond 20000 minions. With 3006.3, this error started occurring at 2500 minions. Disabling master job cache fixed the problem to some extent. Still not able to scale beyond 10000 minions with version 3006.3 . Started observing master overloaded error message in syndic log. We did not observe this behavior with version 3004.2 . Managed to scale up to 25000 minions with version 3004.2

Guillaume-COEUGNET commented 11 months ago

Hello,

Using the same version, 3006.3, I can confirm this bug too. Launching state.highstate on several servers return : [DEBUG ] jid 20231115143559173088 return from uzdact1d02 [DEBUG ] jid 20231115143559173088 return from uzdact1d02 [DEBUG ] jid 20231115143559173088 return from uydarhai02 [DEBUG ] jid 20231115143559173088 return from uzdact1d02 [DEBUG ] jid 20231115143559173088 return from uzdact1d02 [DEBUG ] jid 20231115143559173088 return from uzdact1d02 [DEBUG ] jid 20231115143559173088 return from uzdact1d02 [DEBUG ] jid 20231115143559173088 return from uzdact1d02 [DEBUG ] jid 20231115143559173088 return from uzdact1d02 [DEBUG ] jid 20231115143559173088 return from uzdact1d02 [DEBUG ] jid 20231115143559173088 return from uzdact1d02

This is then logged as : 2023-11-15 15:36:24,079 [salt.loaded.int.returner.local_cache:148 ][ERROR ][2162] An extra return was detected from minion uzdact1d02, please verify the minion, this could be a replay attack

tomsolpa commented 11 months ago

Same issue for us since version 3006.3,.

A lot of extra returns from syndic to master of master occurring when more than ~1200 jobs are running simultaneously.

lomeroe commented 10 months ago

Same issue here. I actually didn't it see it manifest itself until the MoMs were updated to 3006 (our syndics were at 3006 for a few weeks before the MoMs were upgraded).

Once "delayed" it seems to continually send the "delayed" data even when the master receives it (data seen coming across the master event bus over and over is available in the master job cache after the first time it is received).

lomeroe commented 10 months ago

Additionally, I added some extra logging in _return_pub_syndic in minion.py

                elif future.exception():
                    log.error("LOMEROE - future exception  {}".format(future.exception()))
                    # Previous execution on this master returned an error
                    log.error(
                        "Unable to call %s on %s, trying another...", func, master
                    )

and the exception being thrown is "Nonce verification error", so perhaps this is related to #65114. I unfortunately can't upgrade to 3006.5 due to other issues to see if it corrects this in my environment. Curious if any of the others in this thread have tested against 3006.5

Guillaume-COEUGNET commented 9 months ago

Hello,

@Iomeroe We're still using 3006.3. Upgrading our infrastructure is time consumming. @dwoz Is there any update on this bug please ?

Regards.

lomeroe commented 9 months ago

I haven't seen the "nonce verification error" exception since I posted that comment, so that may have been a red herring or one-off event. That said, I went ahead and patched our MoMs and syndics with the fixes from #65247 which are mentioned as the fix in #65114 just to test it out, but it didn't change anything with regards to this problem...

Edit to add: the exception being thrown in _return_pub_syndic from my additional logging has just been "Message Timed Out"

Guillaume-COEUGNET commented 9 months ago

Hello,

We have several worker threads on MoM to manage our 4000 minions. When a highstate is launch, each workers start to report their job returns in the jobs_cache. It seems that several workers try to report the same job for the same minion :

openat(AT_FDCWD, "/var/cache/salt/master/jobs/d5/a50cb9e1b6709a1e6ead2cffcdcfaeb7c5585a6b85c8d9b6b6bfd09b5ddbdd/.minions.uzgcicpp99.p", O_RDONLY|O_CLOEXEC) = 152
fstat(152, {st_mode=S_IFREG|0644, st_size=1268, ...}) = 0
ioctl(152, TCGETS, 0x7ffc0e3d2f00)      = -1 ENOTTY (Inappropriate ioctl for device)
lseek(152, 0, SEEK_CUR)                 = 0
fcntl(152, F_GETFD)                     = 0x1 (flags FD_CLOEXEC)
fcntl(152, F_SETFD, FD_CLOEXEC)         = 0
lseek(152, 0, SEEK_CUR)                 = 0
fstat(152, {st_mode=S_IFREG|0644, st_size=1268, ...}) = 0
read(152, "\334\0s\252uzmupdid01\252uzcoimwp01\252uzcopm"..., 1269) = 1268
read(152, "", 1)                        = 0
close(152)                              = 0
stat("/var/cache/salt/master/jobs/d5/a50cb9e1b6709a1e6ead2cffcdcfaeb7c5585a6b85c8d9b6b6bfd09b5ddbdd/nocache", 0x7ffc0e3d3080) = -1 ENOENT (No such file or directory)
stat("/var/cache/salt/master/jobs/d5/a50cb9e1b6709a1e6ead2cffcdcfaeb7c5585a6b85c8d9b6b6bfd09b5ddbdd", {st_mode=S_IFDIR|0755, st_size=72540, ...}) = 0
mkdir("/var/cache/salt/master/jobs/d5/a50cb9e1b6709a1e6ead2cffcdcfaeb7c5585a6b85c8d9b6b6bfd09b5ddbdd/maoi41", 0777) = -1 EEXIST (File exists)
getpid()                                = 3041093
write(2, "[ERROR   ] An extra return was d"..., 116) = 116

Maybe it would help.

lomeroe commented 7 months ago

We attempted raising return_retry_timer and return_retry_timer_max to high values (480 and 600 respectively in our environment), which mildly helped this issue - but did not make it go away completely (looking back at our syndic config history, we'd also raised these when moving to 3003 from 2019.2).

In the end, the problem seemed to be that the syndic reads every return off the event bus during its polling interval and attempts to send them all to the MoM in one large return. This seemed to be the crux of the issue with simultaneous jobs, which was flooding the MoM event bus with one single, giant return that was overrunning everything else (i.e. one return to the MoM that contained hundreds of returns from the syndic). If the MoM could not handle it quickly enough, the syndic would just resend it (causing the "extra return" log events), but also exasperating the issue on the MoM's event bus (subsequent returns from the syndic to the MoM would also be blocked until the large return finally succeeded or the syndic was restarted). Perhaps throwing more CPU at the MoM which would allow more worker threads would have helped as well, but we were determined not to just throw more hardware at the problem (edit to add: I also don't know that more worker threads would matter, being one large return, it would seem likely it is confined to one worker thread).

Modifying our syndics with the following patch has yielded good results, which sends individual returns up to the MoM, instead of grouping them together. Perhaps some middle ground could be found, and returns could be sent in a configurable number of chunks, but in our environment, we have not seen errors or issues with returns in our time of running this patch. I'd be interested to see how it performs in the minion swarm setup used by @anandarajan-vivekanandam-agilysys

Here's the patch, for a 3006.7 minion.py

--- minion.py.3006_7    2024-03-06 08:42:42.958529166 -0600
+++ minion.py.3006_7.updated    2024-03-06 08:44:43.441609867 -0600
@@ -3662,7 +3662,8 @@
                     self._mark_master_dead(master)
                     del self.pub_futures[master]
                     # Add not sent data to the delayed list and try the next master
-                    self.delayed.extend(data)
+                    if data not in self.delayed:
+                        self.delayed.append(data)
                     continue
             future = getattr(syndic_future.result(), func)(
                 values, "_syndic_return", timeout=self._return_retry_timer(), sync=False
@@ -3805,13 +3806,19 @@
                 },
             )
         if self.delayed:
-            res = self._return_pub_syndic(self.delayed)
-            if res:
-                self.delayed = []
+            nolonger_delayed = []
+            for c in range(0, len(self.delayed)):
+                res = self._return_pub_syndic(self.delayed[c])
+                if res:
+                    nolonger_delayed.append(c)
+            for c in nolonger_delayed:
+                self.delayed.pop(c)
         for master in list(self.job_rets.keys()):
-            values = list(self.job_rets[master].values())
-            res = self._return_pub_syndic(values, master_id=master)
-            if res:
+            for return_key in list(self.job_rets[master].keys()):
+                res = self._return_pub_syndic(self.job_rets[master][return_key], master_id=master)
+                if res:
+                    del self.job_rets[master][return_key]
+            if not self.job_rets[master]:
                 del self.job_rets[master]

     def destroy(self):

anandarajan-vivekanandam-agilysys commented 7 months ago

@lomeroe We have upgraded the salt versions in Masters (MoM and Syndics) to 3006.5 . We are not observing the error "An extra return was detected" anymore in Master log. This may be mainly due to the fact that we moved away from pinging every minion from Master of Masters for gathering the live status of the minion. Instead, we configured beacon on Minions and started capturing the beacon events using an engine in Syndic masters. We have restricted the async job execution from Masters to 500. This approach has eliminated the extra return problem and the infra is now stable and could hold up to 80K minions at a time.

We could not upgrade our minions beyond 3004.2, because we are seeing frequent timeouts in the latest versions of salt minions. So I could replicate this in our swarm for now.

saltstack / salt

[BUG] Getting the error "An extra return was detected from minion" with Multimaster Syndic setup post upgrade to 3006.3 one dir #65516