saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Install Salt from the Salt package repositories here:
https://docs.saltproject.io/salt/install-guide/en/latest/
Apache License 2.0
14.18k stars 5.48k forks source link

[BUG] Uncaught exception in zmqstream callback, master failed to authenticate messages and minions not re-connecting to master, master getting duplicate minions ID #66323

Open amalaguti opened 7 months ago

amalaguti commented 7 months ago

Description I started to notice this recently on my lab environment,

after starting master service due it was left down for some time (a bunch of hours) (over the weekend, or over night due some maintenance, it's a lab server) The minions (Windows, the problem does not seem to affect the minion running on the master) don't seem to establish a proper connection to the server and when running test.ping from the master, it behaves erraticaly showing non consistent responses, and even weirder is that the same minion ID is shown multiple times in the output of the test.ping command The response may vary True or False

A following test.ping may get a different behavior. Running salt-call test.ping from the minions seem to be fine, but the issue does not go away till restarting the salt-minion on Windows so it re-establish a proper connection to the master again

[root@vesselsim ~]# salt-key
Accepted Keys:
vesselsim-master-minion
vesselsim-win-ems-1
vesselsim-win-ems-2

[root@vesselsim ~]# salt \* test.ping
vesselsim-master-minion:
    True
vesselsim-win-ems-2:
    True
vesselsim-win-ems-2:
    True
vesselsim-win-ems-2:
    True
vesselsim-win-ems-2:
    True
vesselsim-win-ems-1:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20240408125122291794
ERROR: Minions returned with non-zero exit code

[root@vesselsim salt]# salt \* test.ping
vesselsim-master-minion:
    True
vesselsim-win-ems-2:
    True
vesselsim-win-ems-1:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20240408130829457932
ERROR: Minions returned with non-zero exit code

[root@vesselsim salt]# salt \* test.ping
vesselsim-master-minion:
    True
vesselsim-win-ems-2:
    True
vesselsim-win-ems-2:
    True
vesselsim-win-ems-2:
    True
vesselsim-win-ems-1:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

Master:

[DEBUG   ] Sending event: tag = salt/auth; data = {'result': True, 'act': 'accept', 'id': 'vesselsim-win-ems-1', 'pub': '-----BEGIN PUBLIC KEY-----\nMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA3dgpz3COQw9TO39uto9w\nSVx8Bky568Gqngh4cXuuuokpaiUbXsfPtbym4uE3FDOwcvtwH/j0lwKhgiES2Doa\n0mkA4RT3Y/F0/itrZe9K7ulK9RV6IZeml8tWEeeAhOI/az/nYuHCedvE0+9UgnDh\nBnTSGGKR3B6s9EVHnCMLCsdQwY+LmuqcQvWYm7/HbM5zVAg/24iob2RbonC3dGGq\neRdhatlUgmm9UGJPs8Omru0jUdXjs1Ni2AKhtOqkej3AaRBNDS4FNjgjmxs3i8u9\nmtUc6kba0Gh8rF2mLnYdm9pF2aTxKC2UpMB1HryFGHieJKwHYCR62F+J2DjH2cAA\nnwIDAQAB\n-----END PUBLIC KEY-----', '_stamp': '2024-04-08T12:51:37.240024'}
[DEBUG   ] Gathering reactors for tag salt/auth
[DEBUG   ] salt.crypt.get_rsa_key: Loading private key
[DEBUG   ] salt.crypt.sign_message: Signing message.
[DEBUG   ] Failed to authenticate message
[DEBUG   ] Minion failed to auth to master. Since the payload is encrypted, it is not known which minion failed to authenticate. It is likely that this is a transient failure due to the master rotating its public key.
[DEBUG   ] Failed to authenticate message
[DEBUG   ] Minion failed to auth to master. Since the payload is encrypted, it is not known which minion failed to authenticate. It is likely that this is a transient failure due to the master rotating its public key.

Minion:

2024-04-08 13:11:39,933 [tornado.general  :590 ][ERROR   ][4004] Uncaught exception in ZMQStream callback
Traceback (most recent call last):
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\zmq\eventloop\zmqstream.py", line 584, in _run_callback
    f = callback(*args, **kwargs)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\transport\zeromq.py", line 628, in handle_reply
    future.set_result(data)
AttributeError: 'NoneType' object has no attribute 'set_result'
2024-04-08 13:11:39,933 [tornado.general  :645 ][ERROR   ][4004] Uncaught exception in zmqstream callback
Traceback (most recent call last):
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\zmq\eventloop\zmqstream.py", line 634, in _handle_events
    self._handle_recv()
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\zmq\eventloop\zmqstream.py", line 663, in _handle_recv
    self._run_callback(callback, msg)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\zmq\eventloop\zmqstream.py", line 584, in _run_callback
    f = callback(*args, **kwargs)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\transport\zeromq.py", line 628, in handle_reply
    future.set_result(data)
AttributeError: 'NoneType' object has no attribute 'set_result'
2024-04-08 13:11:39,933 [tornado.application:640 ][ERROR   ][4004] Exception in callback (<zmq.Socket(zmq.REQ) at 0x232fe19e7a0>, <function wrap.<locals>.null_wrapper at 0x00000232FE1B2E60>)
Traceback (most recent call last):
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\ioloop.py", line 890, in start
    handler_func(fd_obj, events)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\stack_context.py", line 278, in null_wrapper
    return fn(*args, **kwargs)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\zmq\eventloop\zmqstream.py", line 634, in _handle_events
    self._handle_recv()
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\zmq\eventloop\zmqstream.py", line 663, in _handle_recv
    self._run_callback(callback, msg)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\zmq\eventloop\zmqstream.py", line 584, in _run_callback
    f = callback(*args, **kwargs)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\transport\zeromq.py", line 628, in handle_reply
    future.set_result(data)
AttributeError: 'NoneType' object has no attribute 'set_result'

Setup Please be as specific as possible and give set-up details. 1 master - salt 3006.4 (Sulfur) 1 minion running on master - 3006.4 2 Windows minion - 3006.4

Steps to Reproduce the behavior (Include debug logs if possible and relevant)

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.) ```yaml PASTE HERE ```

Additional context Add any other context about the problem here.

amalaguti commented 7 months ago

Update, Still not very deterministic, but I was able to reproduce it twice, by keeping the salt-master service down for about 30 minutes, then starting the master and twice I've seen one of the minions to fail the same way

I see this on the master which seems to be a "normal" debug message

[DEBUG   ] Failed to authenticate message
[DEBUG   ] Minion failed to auth to master. Since the payload is encrypted, it is not known which minion failed to authenticate. It is likely that this is a transient failure due to the master rotating its public key.
[DEBUG   ] salt.crypt.get_rsa_pub_key: Loading public key
[DEBUG   ] Sending event: tag = minion_ping; data = {'id': 'vesselsim-win-ems-1', 'cmd': '_minion_event', 'pretag': None, 'data': 'ping', 'tag': 'minion_ping', '_stamp': '2024-04-08T15:18:22.457113'}

But on the minion:

2024-04-08 15:14:40,442 [salt.crypt       :1486][DEBUG   ][4660] Failed to authenticate message
2024-04-08 15:14:40,457 [tornado.application:353 ][ERROR   ][4660] Future <salt.ext.tornado.concurrent.Future object at 0x00000187D74D7EE0> exception was never retrieved: Traceback (most recent call last):
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\channel\client.py", line 294, in _crypted_transfer
    ret = yield _do_transfer()
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\gen.py", line 1056, in run
    value = future.result()
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\gen.py", line 1070, in run
    yielded = self.gen.send(value)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\channel\client.py", line 284, in _do_transfer
    data = self.auth.crypticle.loads(data, raw, nonce=nonce)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\crypt.py", line 1522, in loads
    data = self.decrypt(data)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\crypt.py", line 1487, in decrypt
    raise AuthenticationError("message authentication failed")
salt.exceptions.AuthenticationError: message authentication failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\minion.py", line 2727, in handle_event
    yield _minion.req_channel.send(
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\gen.py", line 1056, in run
    value = future.result()
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\channel\client.py", line 341, in send
    ret = yield self._crypted_transfer(load, timeout=timeout, raw=raw)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\gen.py", line 1056, in run
    value = future.result()
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\channel\client.py", line 298, in _crypted_transfer
    ret = yield _do_transfer()
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\gen.py", line 1056, in run
    value = future.result()
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\gen.py", line 1070, in run
    yielded = self.gen.send(value)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\channel\client.py", line 284, in _do_transfer
    data = self.auth.crypticle.loads(data, raw, nonce=nonce)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\crypt.py", line 1522, in loads
    data = self.decrypt(data)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\crypt.py", line 1487, in decrypt
    raise AuthenticationError("message authentication failed")
salt.exceptions.AuthenticationError: message authentication failed
2024-04-08 15:14:40,457 [tornado.general  :590 ][ERROR   ][4660] Uncaught exception in ZMQStream callback
Traceback (most recent call last):
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\zmq\eventloop\zmqstream.py", line 584, in _run_callback
    f = callback(*args, **kwargs)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\transport\zeromq.py", line 628, in handle_reply
    future.set_result(data)
AttributeError: 'NoneType' object has no attribute 'set_result'
2024-04-08 15:14:40,457 [tornado.general  :645 ][ERROR   ][4660] Uncaught exception in zmqstream callback
Traceback (most recent call last):
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\zmq\eventloop\zmqstream.py", line 634, in _handle_events
    self._handle_recv()
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\zmq\eventloop\zmqstream.py", line 663, in _handle_recv
    self._run_callback(callback, msg)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\zmq\eventloop\zmqstream.py", line 584, in _run_callback
    f = callback(*args, **kwargs)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\transport\zeromq.py", line 628, in handle_reply
    future.set_result(data)
AttributeError: 'NoneType' object has no attribute 'set_result'
2024-04-08 15:14:40,457 [tornado.application:640 ][ERROR   ][4660] Exception in callback (<zmq.Socket(zmq.REQ) at 0x187d5dee740>, <function wrap.<locals>.null_wrapper at 0x00000187D5E02EF0>)
Traceback (most recent call last):
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\ioloop.py", line 890, in start
    handler_func(fd_obj, events)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\stack_context.py", line 278, in null_wrapper
    return fn(*args, **kwargs)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\zmq\eventloop\zmqstream.py", line 634, in _handle_events
    self._handle_recv()
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\zmq\eventloop\zmqstream.py", line 663, in _handle_recv
    self._run_callback(callback, msg)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\zmq\eventloop\zmqstream.py", line 584, in _run_callback
    f = callback(*args, **kwargs)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\transport\zeromq.py", line 628, in handle_reply
    future.set_result(data)
AttributeError: 'NoneType' object has no attribute 'set_result'

The difference in this minion is the following settings, but I'm not sure if it is related

ping_interval: 5
auth_safemode: True
dwoz commented 7 months ago

@amalaguti Please provide the output of salt --versions-report.

amalaguti commented 7 months ago

@amalaguti Please provide the output of salt --versions-report.

(MASTER)

salt --versions-report

Salt Version: Salt: 3006.4

Python Version: Python: 3.10.13 (main, Oct 4 2023, 21:54:22) [GCC 11.2.0]

Dependency Versions: cffi: 1.14.6 cherrypy: unknown dateutil: 2.8.1 docker-py: Not Installed gitdb: Not Installed gitpython: Not Installed Jinja2: 3.1.2 libgit2: Not Installed looseversion: 1.0.2 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.2 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 22.0 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.9.8 pygit2: Not Installed python-gnupg: 0.4.8 PyYAML: 6.0.1 PyZMQ: 23.2.0 relenv: 0.13.12 smmap: Not Installed timelib: 0.2.4 Tornado: 4.5.3 ZMQ: 4.3.4

System Versions: dist: rhel 9.1 Plow locale: utf-8 machine: x86_64 release: 5.14.0-162.23.1.el9_1.x86_64 system: Linux version: Red Hat Enterprise Linux 9.1 Plow

amalaguti commented 7 months ago

Windows minion (note: now it's 3006.6 due I was testing if it's fixed in newer versions, but should be 3006.4)

Salt Version: Salt: 3006.6

Python Version: Python: 3.10.13 (heads/main:5600dd9, Nov 15 2023, 04:42:06) [MSC v.1937 64 bit (AMD64)]

Dependency Versions: cffi: 1.14.6 cherrypy: 18.6.1 dateutil: 2.8.1 docker-py: Not Installed gitdb: 4.0.7 gitpython: Not Installed Jinja2: 3.1.3 libgit2: Not Installed looseversion: 1.0.2 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.2 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 22.0 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.19.1 pygit2: Not Installed python-gnupg: 0.4.8 PyYAML: 6.0.1 PyZMQ: 25.0.2 relenv: 0.14.2 smmap: 4.0.0 timelib: 0.2.4 Tornado: 4.5.3 ZMQ: 4.3.4

System Versions: dist: locale: cp1252 machine: AMD64 release: 2019Server system: Windows version: 2019Server 10.0.17763 SP0 Multiprocessor Free

xorinzor commented 7 months ago

Looks very similar to my issue that I created 2 weeks ago #66288 @dwoz can you have a look at that issue too? Have yet to receive a response from anyone of the salt team.

amalaguti commented 7 months ago

Looks very similar to my issue that I created 2 weeks ago #66288 @dwoz can you have a look at that issue too? Have yet to receive a response from anyone of the salt team.

Agreed, looks similar, I just wasn't sure and found it after I've created it this one