saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Install Salt from the Salt package repositories here:
https://docs.saltproject.io/salt/install-guide/en/latest/
Apache License 2.0
14.22k stars 5.49k forks source link

[BUG] [3007.1] TCP Transport stuck in a loop connecting to dead master #66966

Open AppCrashExpress opened 1 month ago

AppCrashExpress commented 1 month ago

Description

Hello!

We have updated Saltstack to 3007.1 (hooray!).

Our masters are deployed in the way that they are pretty ephemeral, which means they might sometimes go offline or have their records in DNS zone changed. However, if minion picks a master that happens to be an outdated record or simply offline, it gets stuck in a loop and never tries to switch to another:

[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'minion_id', 'tcp://192.168.0.1:4506')
[DEBUG   ] salt.crypt.get_rsa_key: Loading private key
[DEBUG   ] salt.crypt._get_key_with_evict: Loading private key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] salt.crypt.get_rsa_pub_key: Loading public key
[TRACE   ] ReqChannel send clear load={'cmd': '_auth', 'id': 'minion_id', 'nonce': '0123456789abcdef0123456789abcdef', 'token': b'bytestring', 'pub': 'pubkey'}
[WARNING ] TCP Message Client encountered an exception while connecting to 192.168.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
[WARNING ] TCP Message Client encountered an exception while connecting to 192.168.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
[WARNING ] TCP Message Client encountered an exception while connecting to 192.168.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
...

That goes on and on.

This issue was tested on the commit ID: https://github.com/saltstack/salt/commit/2b266935e42780ebd86b3a144d6897f0ae174b2b (i think? lost track)

Setup

This configuration contains localhost, but master on localhost should not be configured. The point of this configuration is to emulate dead master, which means it should eventually skip 127.0.0.1.

master:
  - 127.0.0.1
  - salt-master1.example.org
  - salt-master2.example.org
master_type: str
random_master: False # Force to be in this order
master_alive_interval: 1
retry_dns: 0 # Failover should try masters on failure

transport: tcp
auth_timeout: 3
auth_tries: 2
acceptance_wait_time: 5
random_reauth_delay: 0
ping_interval: 5

Steps to Reproduce the behavior

  1. Configure TCP transport on both master and minion. First IP on minion should point to dead/nonexistant master, and there should be no master randomization (for determinism);
  2. Run minion salt-call -l debug state.test;
  3. Minion will attempt to connect to dead master first and will continue to do so forever:
    [WARNING ] TCP Message Client encountered an exception while connecting to 127.0.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
    [WARNING ] TCP Message Client encountered an exception while connecting to 127.0.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
    [WARNING ] TCP Message Client encountered an exception while connecting to 127.0.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
    ...

Expected behavior Minion should, instead, eventually give up and try next master in list:

[WARNING ] TCP Message Client encountered an exception while connecting to 127.0.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
[WARNING ] TCP Message Client encountered an exception while connecting to 127.0.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
[TRACE   ] Failed to send msg SaltReqTimeoutError('Message timed out')
[DEBUG   ] Closing AsyncReqChannel instance
[INFO    ] Master 127.0.0.1 could not be reached, trying next master (if any)
[DEBUG   ] "salt-master1.example.org" Not an IP address? Assuming it is a hostname.
[WARNING ] Master ip address changed from 127.0.0.1 to 192.168.0.1
[DEBUG   ] Master URI: tcp://192.168.0.1:4506
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'minion_id', 'tcp://192.168.0.1:4506')
[DEBUG   ] salt.crypt.get_rsa_key: Loading private key

Screenshots Most likely inapplicable (big wall of warnings).

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.) ```yaml Salt Version: Salt: 3007.1 Python Version: Python: 3.12.3 (main, Jun 3 2024, 10:10:13) [Clang 16.0.6 ] Dependency Versions: cffi: 1.16.0 cherrypy: Not Installed dateutil: 2.9.0.post0 docker-py: 7.1.0 gitdb: Not Installed gitpython: Not Installed Jinja2: 3.1.4 libgit2: Not Installed looseversion: 1.3.0 M2Crypto: 0.38.0 Mako: 1.3.5 msgpack: 1.0.8 msgpack-pure: Not Installed mysql-python: 1.4.6 packaging: 21.3 pycparser: 2.22 pycrypto: Not Installed pycryptodome: Not Installed pygit2: Not Installed python-gnupg: 0.5.2 PyYAML: 5.4.1 PyZMQ: 25.1.2 relenv: Not Installed smmap: 5.0.1 timelib: 0.3.0 Tornado: 6.4.1 ZMQ: 4.1.2 Salt Package Information: Package Type: Not Installed System Versions: dist: ubuntu 22.04 jammy locale: utf-8 machine: x86_64 release: 6.8.0-40-generic system: Linux version: Ubuntu 22.04 jammy ```

Additional context

This issue is caused by TCP Transport attempting to connect before setting up handlers for message timeout: https://github.com/saltstack/salt/blob/2b266935e42780ebd86b3a144d6897f0ae174b2b/salt/transport/tcp.py#L1849-L1857

The solution seem to be as simple as removing them, since TCP transport tries connecting after setting up proper timeout handlers:

     async def send(self, load, timeout=60):
-        await self.connect()
         if self._closing:
             raise ClosingError()
-        while not self._stream:
-            await asyncio.sleep(0.03)
         message_id = self._message_id()
         header = {"mid": message_id}
         future = tornado.concurrent.Future()

However, these lines are probably here for a reason? Or is it safe to remove them? They weren't here in 3005.1 MessageClient, which is confusing :( https://github.com/saltstack/salt/blob/6226b9cb582daf675c7b49e59af16dc5d338d10d/salt/transport/tcp.py#L753-L757

AppCrashExpress commented 1 month ago

I think I broke the triage workflow by renaming the issue before it was done running. Sorry!