saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Install Salt from the Salt package repositories here:
https://docs.saltproject.io/salt/install-guide/en/latest/
Apache License 2.0
14.19k stars 5.48k forks source link

[BUG] state.apply performance degredation with master_type: failover #60854

Open lomeroe opened 3 years ago

lomeroe commented 3 years ago

Description When a minion is configured with a list of masters and the master_type is set to failover, the minion will connect to every master in the list on every state item completion. This increases the overall state run time several fold. If any master in the list is down, the overall state run time increases even more.

Setup minion config:

master_type: failover
master:
  - master1
  - master2
  - master3

A state file that contains some number of items. The more items in the state, the longer the state will take...

Please be as specific as possible and give set-up details.

each state item will have an entry like:

[INFO    ] Completed state [my_state] at time 17:20:59.541977 (duration_in_ms=17.195)
[DEBUG   ] Initializing new SAuth for ('/etc/salt/pki/minion', 'my_minion_id', 'tcp://10.10.10.2:4506')
[DEBUG   ] salt.crypt.get_rsa_key: Loading private key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'my_minion_id', 'tcp://10.10.10.100:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'my_minion_id', 'tcp://10.10.10.100:4506')
[DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://10.10.10.100:4506
[DEBUG   ] Trying to connect to: tcp://10.10.10.100:4506
[DEBUG   ] Closing AsyncZeroMQReqChannel instance

These lines will repeat for every master in the master list, such as:

[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'my_minion_id', 'tcp://<master2_ip>:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'my_minion_id', 'tcp://<master2_ip>:4506')
[DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://<master2_ip>:4506
[DEBUG   ] Trying to connect to: tcp://<master2_ip>:4506
[DEBUG   ] Closing AsyncZeroMQReqChannel instance
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/minion', 'my_minion_id', 'tcp://<master3_ip>:4506', 'aes')
[DEBUG   ] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'my_minion_id', 'tcp://<master3_ip>:4506')
[DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://<master3_ip>:4506
[DEBUG   ] Trying to connect to: tcp://<master3_ip>:4506
[DEBUG   ] Closing AsyncZeroMQReqChannel instance

The "total run time" does not account for the delay inhibited for connecting to each master.

Expected behavior The minion only communicates with the initial publish master for returns (unless it needs to switch mid-run due to a master failure)

Versions Report tested with salt-minion 2018.3, 2019.2, 3001, 3003 salt-masters at 3001

a salt minion at version 2018.3 does not exhibit this behavior (it only returns to a single master)

Additional context Anecdotally, A test state with 77 state items:

1) with "master_type" not explicitly set, runs in ~16 seconds of "real" time and reports a total run time of 2.2s-2.3s 2) with master_type set to failover, runs in ~1 minute 30 seconds of "real" time and reports a run time of 2.2s-2.3s 3) with master_type set to failover and one master in the list "offline", runs in ~8 minutes of "real" time and still reports a run time of 2.2s-2.3s

lomeroe commented 3 years ago

The master return connections are being caused by state_events: True being set on the master. This causes event.fire_master to be called, which then fires the event on every master in the list.