saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Get access to the Salt software package repository here:
https://repo.saltproject.io/
Apache License 2.0
14.1k stars 5.47k forks source link

[BUG] Salt scheduler for state.sls fails erraticaly on multimaster setup when connected master is not available, does not connect to the other master #66375

Open amalaguti opened 5 months ago

amalaguti commented 5 months ago

Description Windows minion 3006.1 (and 3006.7, possibly all 3006.x), configured in multimaster, having connected to the first master in list, a scheduled job is triggered when this connected master is offline, the minion fails to run the job due it seems can't connect to the other available master. Looks like it remains stick to the first master connected, and does not flip the connection to the other master.

In many occassions the minion ssems to stay in not a healthy state, requiring the minion service to be restarted to re-establish healthy connection.

The failure is erratic and it's not consistent, some times it connects and runs fine in all minions, some times a limited set of the minions succeed while other group fail.

Tried multiple times, with different settings like ping_interval, master_tries, auth_safemode, and return_job: False no luck to make it consistent. No setting seem to be able to definitely fix it.

Clearly the problem is due the schedule job runnning state.sls which requires to establish connection to the master. If the job is tweaked to run something else that does not require master connection, it works fine

master: 
  - 172.21.0.10
  - 172.21.0.11

master_tries: -1
# Seems to be irrelevant minions fail with/without these settings
#ping_interval: 1
#auth_safemode: True
PS C:\Users\adrian> salt-call --local schedule.list
local:
    schedule:
      schedule_new_task:
        args:
        - utils.reboot_system_module
        enabled: true
        function: state.sls
        kwargs:
          queue: true
          saltenv: base
        maxrunning: 1
        name: schedule_new_task
        return_job: false
        saved: true
        splay: 10
        when:
        - '2024-04-16 02:43:58'

Even an exception can be seen while the minion tries to connect to the master(s), it's not always enough to determine the issue, despite the exception, sometimes it works fine

salt.exceptions.SaltReqTimeoutError: Message timed out
2024-04-16 02:30:57,374 [salt.utils.schedule                                                      :32  ][TRACE   ][5692] ==== evaluating schedule now None =====
2024-04-16 02:30:57,390 [salt.channel.client                                                      :32  ][TRACE   ][5692] Failed to send msg SaltReqTimeoutError('Message timed out')
2024-04-16 02:30:57,390 [tornado.application                                                      :353 ][ERROR   ][5692] Future <salt.ext.tornado.concurrent.Future object at 0x000002389E8F7EB0> exception was never retrieved: Traceback (most recent call last):
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\minion.py", line 2700, in handle_event
    yield _minion.req_channel.send(
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\gen.py", line 1056, in run
    value = future.result()
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\channel\client.py", line 295, in send
    ret = yield self._crypted_transfer(load, timeout=timeout, raw=raw)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\gen.py", line 1056, in run
    value = future.result()
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\channel\client.py", line 252, in _crypted_transfer
    value = future.result()
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\channel\client.py", line 233, in _do_transfer
    data = yield self.transport.send(
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\gen.py", line 1056, in run
    value = future.result()
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\transport\zeromq.py", line 916, in send
    ret = yield self.message_client.send(load, timeout=timeout)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\gen.py", line 1056, in run
    value = future.result()
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\transport\zeromq.py", line 626, in send
    recv = yield future
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\gen.py", line 1056, in run
    value = future.result()
  File "C:\Program Files\Salt Project\Salt\lib\site-packages\salt\ext\tornado\concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
salt.exceptions.SaltReqTimeoutError: Message timed out

Setup Setup a multimaster minion Run a state to save a schedule job to run a few minutes later that calls state.sls Put the active master offline, usually the first one in the minion master list Wait for the schedule job to run (may or may not work) It's convenient to try with multiple minions, some may work, some may not.

Please be as specific as possible and give set-up details.

Steps to Reproduce the behavior (Include debug logs if possible and relevant)

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.) ```yaml PASTE HERE ```

Additional context Add any other context about the problem here.

amalaguti commented 5 months ago

Including 3007.0, same behavior,

Fails the same way, during the scheduler execution looks like it fails to contact the other master, and so the job fails

salt.exceptions.SaltReqTimeoutError: Request client send timedout

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\utils\schedule.py", line 880, in handle_func
    ret["return"] = self.functions[func](*args, **kwargs)
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\loader\lazy.py", line 160, in __call__
    ret = self.loader.run(run_func, *args, **kwargs)
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\loader\lazy.py", line 1233, in run
    return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\loader\lazy.py", line 1248, in _run_as
    return _func_or_method(*args, **kwargs)
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\modules\state.py", line 1425, in sls
    st_ = salt.state.HighState(
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\state.py", line 5007, in __init__
    BaseHighState.__init__(self, opts)
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\state.py", line 3764, in __init__
    self.opts = self.__gen_opts(opts)
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\state.py", line 3787, in __gen_opts
    mopts = self.client.master_opts()
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\fileclient.py", line 1477, in master_opts
    return self._channel_send(
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\fileclient.py", line 1147, in _channel_send
    return self.channel.send(
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\utils\asynchronous.py", line 139, in wrap
    raise exc_info[1].with_traceback(exc_info[2])
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\utils\asynchronous.py", line 147, in _target
    result = io_loop.run_sync(lambda: getattr(self.obj, key)(*args, **kwargs))
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\tornado\ioloop.py", line 527, in run_sync
    return future_cell[0].result()
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\tornado\gen.py", line 780, in run
    yielded = self.gen.throw(exc)
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\channel\client.py", line 340, in send
    ret = yield self._crypted_transfer(load, timeout=timeout, raw=raw)
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\tornado\gen.py", line 767, in run
    value = future.result()
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\tornado\gen.py", line 780, in run
    yielded = self.gen.throw(exc)
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\channel\client.py", line 291, in _crypted_transfer
    yield self.auth.authenticate()
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\tornado\gen.py", line 767, in run
    value = future.result()
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\crypt.py", line 832, in _authenticate
    creds = yield self.sign_in(channel=channel)
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\tornado\gen.py", line 767, in run
    value = future.result()
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\tornado\gen.py", line 780, in run
    yielded = self.gen.throw(exc)
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\crypt.py", line 951, in sign_in
    raise SaltClientError(
salt.exceptions.SaltClientError: Attempt to authenticate with the salt master failed with timeout error
2024-04-16 14:28:15,021 [salt.utils.schedule                        :963 ][DEBUG   ][3568] schedule.handle_func: Removing C:\ProgramData\Salt Project\Salt\var\cache\salt\minion\proc\20240416142628380999
2024-04-16 14:28:15,881 [salt.utils.process                                                       :1157][DEBUG   ][1168] Subprocess Schedule(name=schedule_new_task, jid=20240416142628380999) cleaned up
amalaguti commented 4 months ago

Issue also valid for 3006.8

2024-05-10 01:56:55,036 [salt.utils.schedule                                                      :411 ][INFO    ][584] Updating job settings for scheduled job: schedule_new_task
2024-05-10 01:56:55,052 [salt.minion                                                              :2744][INFO    ][584] Minion of '172.21.0.10' is handling event tag '/salt/minion/minion_schedule_add_complete'
2024-05-10 01:56:55,052 [salt.minion                                                              :2744][INFO    ][584] Minion of '172.21.0.11' is handling event tag '/salt/minion/minion_schedule_add_complete'
2024-05-10 01:56:55,052 [salt.minion                                                              :2744][INFO    ][584] Minion of '172.21.0.10' is handling event tag '/salt/minion/minion_schedule_add_complete'
2024-05-10 01:56:55,052 [salt.minion                                                              :2744][INFO    ][584] Minion of '172.21.0.11' is handling event tag '/salt/minion/minion_schedule_add_complete'
2024-05-10 01:56:55,052 [salt.state                                                               :319 ][INFO    ][7648] {'schedule_new_task': 'added'}
2024-05-10 01:56:55,052 [salt.state                                                               :2492][INFO    ][7648] Completed state [schedule_new_task] at time 01:56:55.052369 (duration_in_ms=171.872)
2024-05-10 01:56:55,083 [salt.loaded.int.module.cmdmod                                            :438 ][INFO    ][7648] Executing command attrib in directory 'C:\Users\adrian'
2024-05-10 02:01:59,665 [salt.utils.schedule                                                      :1792][INFO    ][584] Running scheduled job: schedule_new_task (runtime missed by 0.665372 seconds) with jid 20240510020159665372
2024-05-10 02:02:46,320 [salt.utils.schedule                      :913 ][ERROR   ][8128] Unhandled exception running state.sls
Traceback (most recent call last):
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\crypt.py", line 768, in sign_in
    payload = yield channel.send(sign_in_payload, tries=tries, timeout=timeout)
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\ext\tornado\gen.py", line 1056, in run
    value = future.result()
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\ext\tornado\concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\ext\tornado\gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\channel\client.py", line 337, in send
    ret = yield self._uncrypted_transfer(load, timeout=timeout)
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\ext\tornado\gen.py", line 1056, in run
    value = future.result()
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\ext\tornado\concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\ext\tornado\gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\channel\client.py", line 308, in _uncrypted_transfer
    ret = yield self.transport.send(
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\ext\tornado\gen.py", line 1056, in run
    value = future.result()
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\ext\tornado\concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\ext\tornado\gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\transport\zeromq.py", line 916, in send
    ret = yield self.message_client.send(load, timeout=timeout)
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\ext\tornado\gen.py", line 1056, in run
    value = future.result()
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\ext\tornado\concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\ext\tornado\gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\transport\zeromq.py", line 595, in send
    recv = yield future
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\ext\tornado\gen.py", line 1056, in run
    value = future.result()
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\ext\tornado\concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
salt.exceptions.SaltReqTimeoutError: Message timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\utils\schedule.py", line 880, in handle_func
    ret["return"] = self.functions[func](*args, **kwargs)
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\loader\lazy.py", line 159, in __call__
    ret = self.loader.run(run_func, *args, **kwargs)
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\loader\lazy.py", line 1245, in run
    return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\loader\lazy.py", line 1260, in _run_as
    return _func_or_method(*args, **kwargs)
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\modules\state.py", line 1425, in sls
    st_ = salt.state.HighState(
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\state.py", line 4954, in __init__
    BaseHighState.__init__(self, opts)
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\state.py", line 3754, in __init__
    self.opts = self.__gen_opts(opts)
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\state.py", line 3777, in __gen_opts
    mopts = self.client.master_opts()
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\fileclient.py", line 1481, in master_opts
    return self._channel_send(
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\fileclient.py", line 1151, in _channel_send
    return self.channel.send(
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\utils\asynchronous.py", line 124, in wrap
    raise exc_info[1].with_traceback(exc_info[2])
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\utils\asynchronous.py", line 130, in _target
    result = io_loop.run_sync(lambda: getattr(self.obj, key)(*args, **kwargs))
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\ext\tornado\ioloop.py", line 459, in run_sync
    return future_cell[0].result()
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\ext\tornado\concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\ext\tornado\gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\channel\client.py", line 340, in send
    ret = yield self._crypted_transfer(load, timeout=timeout, raw=raw)
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\ext\tornado\gen.py", line 1056, in run
    value = future.result()
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\ext\tornado\concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\ext\tornado\gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\channel\client.py", line 290, in _crypted_transfer
    yield self.auth.authenticate()
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\ext\tornado\gen.py", line 1056, in run
    value = future.result()
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\ext\tornado\concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\crypt.py", line 657, in _authenticate
    creds = yield self.sign_in(channel=channel)
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\ext\tornado\gen.py", line 1056, in run
    value = future.result()
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\ext\tornado\concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\ext\tornado\gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\crypt.py", line 776, in sign_in
    raise SaltClientError(
salt.exceptions.SaltClientError: Attempt to authenticate with the salt master failed with timeout error