saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Get access to the Salt software package repository here:
https://repo.saltproject.io/
Apache License 2.0
14.12k stars 5.47k forks source link

[BUG] Master failback fails to work if none of the master's can be resolved #66534

Open Jepson2k opened 4 months ago

Jepson2k commented 4 months ago

Description If a minion is setup in Multi-Master mode and each master is a domain name and none of the domain names can be resolved then the minion only continues to try the last master and never attempts to try the first one again, even if the master_failback parameter is set.

Setup minion config:

master:
    - examplehostname
    - examplehostanme.local
master_type: failover
master_failback: True
retry_dns: 0

Please be as specific as possible and give set-up details.

Steps to Reproduce the behavior

  1. Don't setup master on network to simulate master being down or disconnected.
  2. Setup minion with the configuration file above and run salt-minion.

Expected behavior Minion fails back to trying to resolve first master if it cannot resolve the last master (because the first master might now be up).

Versions Report

salt --versions-report No difference in salt versions between master and minion. ```yaml Salt Version: Salt: 3007.0 Python Version: Python: 3.10.13 (main, Feb 19 2024, 03:31:20) [GCC 11.2.0] Dependency Versions: cffi: 1.16.0 cherrypy: 18.8.0 dateutil: 2.8.2 docker-py: Not Installed gitdb: Not Installed gitpython: Not Installed Jinja2: 3.1.3 libgit2: Not Installed looseversion: 1.3.0 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.7 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 23.1 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.19.1 pygit2: Not Installed python-gnupg: 0.5.2 PyYAML: 6.0.1 PyZMQ: 25.1.2 relenv: 0.15.1 smmap: Not Installed timelib: 0.3.0 Tornado: 6.3.3 ZMQ: 4.3.4 Salt Package Information: Package Type: onedir System Versions: dist: ubuntu 22.04.4 jammy locale: utf-8 machine: x86_64 release: 6.5.0-28-generic system: Linux version: Ubuntu 22.04.4 jammy ```

Additional context I help to manage 50 laptops we use for various events. The setup has to be flexible and work on different networks so we try to use multicast DNS names for resolution. Some networks don't support mDNS but do resolve the hostname. Therefore, we've found decent success by including both the salt-master's hostname and its hostname.local. Unfortunately neither of these name resolution techniques are very reliable so it would be useful for the salt minions to continue to try both rather than just the last one.

Potentially why this is occurring Without diving too deep into the code base here is what I've observing:

  1. At line 687 in Minion.py, opts["master"] which originally was a list is set to just one of the masters: opts["master"] = master
  2. Since none of the master names get resolved the error on line 702 is raised: raise SaltClientError(msg)
  3. The coroutine waits according to acceptance_wait_time parameter in minion config
  4. The routine loop repeats and the eval_master is called and since opts["master"] is now a string the conditional on line 600: elif isinstance(opts["master"], str) and ("master_list" not in opts): is taken instead of the failed conditional on line 611: elif failed: which would set opts["master"] back to the list.

Potential Solution I don't plan on opening a pull request since I am not familiar enough with Salt to know if this break anything else but changing line 600 in minion.py to elif isinstance(opts["master"], str) and ("master_list" not in opts) and not failed: seemed to fix the issue.

Temporary Workaround Adding an IP address such as 127.0.0.1 to the list of masters fixes this issue.