saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Get access to the Salt software package repository here:
https://repo.saltproject.io/
Apache License 2.0
14.14k stars 5.47k forks source link

[BUG] master_type failover connection lost to master after each master_alive_interval and sends a new /start event at every interval #66497

Open amalaguti opened 5 months ago

amalaguti commented 5 months ago

Description 3006.8 Windows minion configured in multimaster failover, when minion starts, after each master_alive_interval it shows "Connection to master (second) lost", even when this master is still available, and tries to connect to the following master in the list, and this loop continues after each master_alive_interval, switching connection from master to master in the masters list

It also generates multiple /start events (on each master) (the minion service is not restarted, so there should not been /start events after the initial connection)

IMPORTANT: Don't see the same behavior in Windows minion 3006.7

salt/minion/minion-win-1/start  {
    "_stamp": "2024-05-10T00:33:48.353903",
    "cmd": "_minion_event",
    "data": "Minion minion-win-1 started at Fri May 10 00:33:47 2024",
    "id": "minion-win-1",
    "pretag": null,
    "tag": "salt/minion/minion-win-1/start"
}
salt/minion/minion-win-1/start  {
    "_stamp": "2024-05-10T00:34:17.309567",
    "cmd": "_minion_event",
    "data": "Minion minion-win-1 started at Fri May 10 00:34:16 2024",
    "id": "minion-win-1",
    "pretag": null,
    "tag": "salt/minion/minion-win-1/start"
}
salt/minion/minion-win-1/start  {
    "_stamp": "2024-05-10T00:34:46.306666",
    "cmd": "_minion_event",
    "data": "Minion minion-win-1 started at Fri May 10 00:34:45 2024",
    "id": "minion-win-1",
    "pretag": null,
    "tag": "salt/minion/minion-win-1/start"
}
master: 
  - 172.21.0.10
  - 172.21.0.11
auth_tries: 3
auth_timeout: 5
master_tries: 1 # same with -1
enable_legacy_startup_events: False
master_type: failover
master_alive_interval: 20
log_level_logfile: info

Minion log (log messages have been changed from log.debug to log.info in salt/minion.py module to avoid using debug logging)

2024-05-10 00:57:17,730 [salt.cli.daemons                         :280 ][INFO    ][5072] Setting up the Salt Minion "minion-win-1"
2024-05-10 00:57:17,967 [salt.cli.daemons                         :82  ][INFO    ][5072] Starting up the Salt Minion
2024-05-10 00:57:17,967 [salt.utils.event                         :1126][INFO    ][5072] Starting pull socket on 4511
2024-05-10 00:57:18,592 [salt.minion                                                              :1305][INFO    ][5072] Creating minion process manager
2024-05-10 00:57:18,639 [salt.minion                                                              :554 ][INFO    ][5072] Got list of available master addresses: ['172.21.0.10', '172.21.0.11']
2024-05-10 00:57:18,639 [salt.minion                                                              :641 ][CRITICAL][5072] 'master_type' set to 'failover' but 'retry_dns' is not 0. Setting 'retry_dns' to 0 to failover to the next master on DNS errors.
2024-05-10 00:57:18,639 [salt.minion                                                              :232 ][INFO    ][5072] Master URI: tcp://172.21.0.10:4506
2024-05-10 00:57:18,655 [salt.minion                                                              :232 ][INFO    ][5072] Master URI: tcp://172.21.0.11:4506
2024-05-10 00:57:18,655 [salt.minion                                                              :712 ][INFO    ][5072] Connecting to master. Attempt 1 of 1
2024-05-10 00:57:18,655 [salt.minion                                                              :232 ][INFO    ][5072] Master URI: tcp://172.21.0.10:4506
2024-05-10 00:57:18,842 [salt.minion                                                              :1402][INFO    ][5072] Connecting minion's long-running req channel
2024-05-10 00:57:19,452 [salt.loader.['172.21.0.10', '172.21.0.11'].int.module.cmdmod             :438 ][INFO    ][5072] Executing command tzutil in directory 'C:\Windows\system32\config\systemprofile'
2024-05-10 00:57:19,686 [salt.utils.schedule                                                      :415 ][INFO    ][5072] Added new job __mine_interval to scheduler
2024-05-10 00:57:19,686 [salt.minion                                                              :1481][INFO    ][5072] Added mine.update to scheduler
2024-05-10 00:57:19,686 [salt.utils.schedule                                                      :415 ][INFO    ][5072] Added new job __master_alive_172.21.0.10 to scheduler
2024-05-10 00:57:19,702 [salt.minion                                                              :2694][INFO    ][5072] Minion is starting as user 'SYSTEM'
2024-05-10 00:57:19,702 [salt.minion                                                              :3119][INFO    ][5072] Minion 'minion-win-1' trying to tune in
2024-05-10 00:57:19,702 [salt.minion                                                              :3129][INFO    ][5072] Minion is ready to receive requests!
2024-05-10 00:57:19,702 [salt.minion                                                              :2744][INFO    ][5072] Minion of '172.21.0.10' is handling event tag '/salt/minion/minion_schedule_delete_complete'
2024-05-10 00:57:19,702 [salt.minion                                                              :2744][INFO    ][5072] Minion of '172.21.0.10' is handling event tag '/salt/minion/minion_schedule_add_complete'
2024-05-10 00:57:19,702 [salt.minion                                                              :2744][INFO    ][5072] Minion of '172.21.0.10' is handling event tag '/salt/minion/minion_schedule_add_complete'
2024-05-10 00:57:19,702 [salt.minion                                                              :2744][INFO    ][5072] Minion of '172.21.0.10' is handling event tag '/salt/minion/minion_schedule_delete_complete'
2024-05-10 00:57:19,702 [salt.minion                                                              :2744][INFO    ][5072] Minion of '172.21.0.10' is handling event tag '__master_req_channel_payload/172.21.0.10'
2024-05-10 00:57:19,702 [salt.minion                                                              :893 ][INFO    ][5072] Minion return retry timer set to 9 seconds (randomized)
2024-05-10 00:57:20,703 [salt.utils.schedule                                                      :1792][INFO    ][5072] Running scheduled job: __mine_interval with jid 20240510005720703948
2024-05-10 00:57:40,714 [salt.utils.schedule                                                            :1792][INFO    ][5072] Running scheduled job: __master_alive_172.21.0.10 with jid 20240510005740714178
2024-05-10 00:57:42,266 [salt.minion                                                                    :2744][INFO    ][5072] Minion of '172.21.0.10' is handling event tag '__master_disconnected'
2024-05-10 00:57:42,266 [salt.minion                                                                    :2828][INFO    ][5072] Connection to master 172.21.0.10 lost
2024-05-10 00:57:42,266 [salt.minion                                                                    :2832][INFO    ][5072] Trying to tune in to next master from master-list
2024-05-10 00:57:42,282 [salt.minion                                                                    :616 ][INFO    ][5072] Moving possibly failed master 172.21.0.10 to the end of the list of masters
2024-05-10 00:57:42,282 [salt.minion                                                                    :187 ][WARNING ][5072] Master ip address changed from 172.21.0.10 to 172.21.0.11
2024-05-10 00:57:42,282 [salt.minion                                                                    :232 ][INFO    ][5072] Master URI: tcp://172.21.0.11:4506
2024-05-10 00:57:42,282 [salt.minion                                                                    :232 ][INFO    ][5072] Master URI: tcp://172.21.0.10:4506
2024-05-10 00:57:42,282 [salt.minion                                                                    :712 ][INFO    ][5072] Connecting to master. Attempt 1 of 1
2024-05-10 00:57:42,282 [salt.minion                                                                    :187 ][WARNING ][5072] Master ip address changed from 172.21.0.10 to 172.21.0.11
2024-05-10 00:57:42,282 [salt.minion                                                                    :232 ][INFO    ][5072] Master URI: tcp://172.21.0.11:4506
2024-05-10 00:57:42,313 [salt.minion                                                                    :2744][INFO    ][5072] Minion of '172.21.0.11' is handling event tag '/salt/minion/minion_schedule_delete_complete'
2024-05-10 00:57:42,391 [salt.minion                                                                    :2859][INFO    ][5072] Re-initialising subsystems for new master 172.21.0.11
2024-05-10 00:57:42,985 [salt.minion                                                                    :2880][INFO    ][5072] Minion is ready to receive requests!
2024-05-10 00:57:43,000 [salt.minion                                                                    :2744][INFO    ][5072] Minion of '172.21.0.11' is handling event tag '__master_connected'
2024-05-10 00:57:43,000 [salt.minion                                                                    :2744][INFO    ][5072] Minion of '172.21.0.11' is handling event tag '__master_req_channel_payload/172.21.0.11'
2024-05-10 00:57:43,000 [salt.minion                                                                    :893 ][INFO    ][5072] Minion return retry timer set to 8 seconds (randomized)
2024-05-10 00:58:03,717 [salt.utils.schedule                                                            :1792][INFO    ][5072] Running scheduled job: __master_alive_172.21.0.11 with jid 20240510005803717225
2024-05-10 00:58:05,248 [salt.minion                                                                    :2744][INFO    ][5072] Minion of '172.21.0.11' is handling event tag '__master_disconnected'
2024-05-10 00:58:05,248 [salt.minion                                                                    :2828][INFO    ][5072] Connection to master 172.21.0.11 lost
2024-05-10 00:58:05,264 [salt.minion                                                                    :2832][INFO    ][5072] Trying to tune in to next master from master-list
2024-05-10 00:58:05,264 [salt.minion                                                                    :616 ][INFO    ][5072] Moving possibly failed master 172.21.0.11 to the end of the list of masters
2024-05-10 00:58:05,264 [salt.minion                                                                    :187 ][WARNING ][5072] Master ip address changed from 172.21.0.11 to 172.21.0.10

Setup 3006.8 master and Windows minion

Steps to Reproduce the behavior Configure minion as shown in multimaster failover, start minion and check minion log and event bus after each master_alive_interval

Expected behavior The minion should stay connected to the second master when the master_alive_interval runs, no additional /start event should be seen after the initial minion start

Versions Report 3006.8

butch12 commented 2 weeks ago

I have also experienced this issue on Windows minions (v3006.9). The master status checks always fail and cause the minions to re-highstate every master_alive_interval.

It looks like problem is caused by this code incorrectly using local address and port rather than remote address and port for the connection check. https://github.com/saltstack/salt/blob/master/salt/modules/win_status.py#L500-L501

As a quick test I patched my local copy of win_status.py and that seems to fix the problem.

@@ -497,8 +497,8 @@

     for conn in conns:
         if conn.status == psutil.CONN_ESTABLISHED:
-            if conn.laddr.port == port:
-                connected_ips.add(conn.laddr.ip)
+            if conn.raddr.port == port:
+                connected_ips.add(conn.raddr.ip)

     return connected_ips

Before patching win_status.py:

PS > salt-call status.master master=saltha-1.foo.com
[INFO    ] Got list of available master addresses: ['saltha-1.foo.com', 'saltha-2.foo.com', 'saltha-3.foo.com']
    False
PS > salt-call status.master master=saltha-2.foo.com
[INFO    ] Got list of available master addresses: ['saltha-1.foo.com', 'saltha-2.foo.com', 'saltha-3.foo.com']
local:
    False
PS > salt-call status.master master=saltha-3.foo.com
[INFO    ] Got list of available master addresses: ['saltha-1.foo.com', 'saltha-2.foo.com', 'saltha-3.foo.com']
local:
    False

After patching win_status.py:

PS > salt-call status.master master=saltha-1.foo.com
[INFO    ] Got list of available master addresses: ['saltha-1.foo.com', 'saltha-2.foo.com', 'saltha-3.foo.com']
    True
PS > salt-call status.master master=saltha-2.foo.com
[INFO    ] Got list of available master addresses: ['saltha-1.foo.com', 'saltha-2.foo.com', 'saltha-3.foo.com']
local:
    False
PS > salt-call status.master master=saltha-3.foo.com
[INFO    ] Got list of available master addresses: ['saltha-1.foo.com', 'saltha-2.foo.com', 'saltha-3.foo.com']
local:
    False