Software to automate the management and configuration of any infrastructure or application at scale. Install Salt from the Salt package repositories here:
Our masters are deployed in the way that they are pretty ephemeral, which means they might sometimes go offline or have their records in DNS zone changed. However, if minion picks a master that happens to be an outdated record or simply offline, it gets stuck in a loop and never tries to switch to another:
[DEBUG ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'minion_id', 'tcp://192.168.0.1:4506')
[DEBUG ] salt.crypt.get_rsa_key: Loading private key
[DEBUG ] salt.crypt._get_key_with_evict: Loading private key
[DEBUG ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG ] salt.crypt.get_rsa_pub_key: Loading public key
[TRACE ] ReqChannel send clear load={'cmd': '_auth', 'id': 'minion_id', 'nonce': '0123456789abcdef0123456789abcdef', 'token': b'bytestring', 'pub': 'pubkey'}
[WARNING ] TCP Message Client encountered an exception while connecting to 192.168.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
[WARNING ] TCP Message Client encountered an exception while connecting to 192.168.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
[WARNING ] TCP Message Client encountered an exception while connecting to 192.168.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
...
This configuration contains localhost, but master on localhost should not be configured.
The point of this configuration is to emulate dead master, which means it should eventually skip 127.0.0.1.
master:
- 127.0.0.1
- salt-master1.example.org
- salt-master2.example.org
master_type: str
random_master: False # Force to be in this order
master_alive_interval: 1
retry_dns: 0 # Failover should try masters on failure
transport: tcp
auth_timeout: 3
auth_tries: 2
acceptance_wait_time: 5
random_reauth_delay: 0
ping_interval: 5
[x] on-prem machine
[ ] VM (Virtualbox, KVM, etc. please specify)
[ ] VM running on a cloud service, please be explicit and add details
[ ] container (Kubernetes, Docker, containerd, etc. please specify)
[ ] or a combination, please be explicit
[ ] jails if it is FreeBSD
[ ] classic packaging
[x] onedir packaging
[ ] used bootstrap to install
Steps to Reproduce the behavior
Configure TCP transport on both master and minion. First IP on minion should point to dead/nonexistant master, and there should be no master randomization (for determinism);
Run minion salt-call -l debug state.test;
Minion will attempt to connect to dead master first and will continue to do so forever:
[WARNING ] TCP Message Client encountered an exception while connecting to 127.0.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
[WARNING ] TCP Message Client encountered an exception while connecting to 127.0.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
[WARNING ] TCP Message Client encountered an exception while connecting to 127.0.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
...
Expected behavior
Minion should, instead, eventually give up and try next master in list:
[WARNING ] TCP Message Client encountered an exception while connecting to 127.0.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
[WARNING ] TCP Message Client encountered an exception while connecting to 127.0.0.1:4506: StreamClosedError('Stream is closed'), will reconnect in 1 seconds
[TRACE ] Failed to send msg SaltReqTimeoutError('Message timed out')
[DEBUG ] Closing AsyncReqChannel instance
[INFO ] Master 127.0.0.1 could not be reached, trying next master (if any)
[DEBUG ] "salt-master1.example.org" Not an IP address? Assuming it is a hostname.
[WARNING ] Master ip address changed from 127.0.0.1 to 192.168.0.1
[DEBUG ] Master URI: tcp://192.168.0.1:4506
[DEBUG ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'minion_id', 'tcp://192.168.0.1:4506')
[DEBUG ] salt.crypt.get_rsa_key: Loading private key
Screenshots
Most likely inapplicable (big wall of warnings).
Versions Report
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
```yaml
Salt Version:
Salt: 3007.1
Python Version:
Python: 3.12.3 (main, Jun 3 2024, 10:10:13) [Clang 16.0.6 ]
Dependency Versions:
cffi: 1.16.0
cherrypy: Not Installed
dateutil: 2.9.0.post0
docker-py: 7.1.0
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 3.1.4
libgit2: Not Installed
looseversion: 1.3.0
M2Crypto: 0.38.0
Mako: 1.3.5
msgpack: 1.0.8
msgpack-pure: Not Installed
mysql-python: 1.4.6
packaging: 21.3
pycparser: 2.22
pycrypto: Not Installed
pycryptodome: Not Installed
pygit2: Not Installed
python-gnupg: 0.5.2
PyYAML: 5.4.1
PyZMQ: 25.1.2
relenv: Not Installed
smmap: 5.0.1
timelib: 0.3.0
Tornado: 6.4.1
ZMQ: 4.1.2
Salt Package Information:
Package Type: Not Installed
System Versions:
dist: ubuntu 22.04 jammy
locale: utf-8
machine: x86_64
release: 6.8.0-40-generic
system: Linux
version: Ubuntu 22.04 jammy
```
Description
Hello!
We have updated Saltstack to 3007.1 (hooray!).
Our masters are deployed in the way that they are pretty ephemeral, which means they might sometimes go offline or have their records in DNS zone changed. However, if minion picks a master that happens to be an outdated record or simply offline, it gets stuck in a loop and never tries to switch to another:
That goes on and on.
This issue was tested on the commit ID: https://github.com/saltstack/salt/commit/2b266935e42780ebd86b3a144d6897f0ae174b2b (i think? lost track)
Setup
This configuration contains localhost, but master on localhost should not be configured. The point of this configuration is to emulate dead master, which means it should eventually skip
127.0.0.1
.Steps to Reproduce the behavior
salt-call -l debug state.test
;Expected behavior Minion should, instead, eventually give up and try next master in list:
Screenshots Most likely inapplicable (big wall of warnings).
Versions Report
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.) ```yaml Salt Version: Salt: 3007.1 Python Version: Python: 3.12.3 (main, Jun 3 2024, 10:10:13) [Clang 16.0.6 ] Dependency Versions: cffi: 1.16.0 cherrypy: Not Installed dateutil: 2.9.0.post0 docker-py: 7.1.0 gitdb: Not Installed gitpython: Not Installed Jinja2: 3.1.4 libgit2: Not Installed looseversion: 1.3.0 M2Crypto: 0.38.0 Mako: 1.3.5 msgpack: 1.0.8 msgpack-pure: Not Installed mysql-python: 1.4.6 packaging: 21.3 pycparser: 2.22 pycrypto: Not Installed pycryptodome: Not Installed pygit2: Not Installed python-gnupg: 0.5.2 PyYAML: 5.4.1 PyZMQ: 25.1.2 relenv: Not Installed smmap: 5.0.1 timelib: 0.3.0 Tornado: 6.4.1 ZMQ: 4.1.2 Salt Package Information: Package Type: Not Installed System Versions: dist: ubuntu 22.04 jammy locale: utf-8 machine: x86_64 release: 6.8.0-40-generic system: Linux version: Ubuntu 22.04 jammy ```Additional context
This issue is caused by TCP Transport attempting to connect before setting up handlers for message timeout: https://github.com/saltstack/salt/blob/2b266935e42780ebd86b3a144d6897f0ae174b2b/salt/transport/tcp.py#L1849-L1857
The solution seem to be as simple as removing them, since TCP transport tries connecting after setting up proper timeout handlers:
However, these lines are probably here for a reason? Or is it safe to remove them? They weren't here in 3005.1 MessageClient, which is confusing :( https://github.com/saltstack/salt/blob/6226b9cb582daf675c7b49e59af16dc5d338d10d/salt/transport/tcp.py#L753-L757