saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Install Salt from the Salt package repositories here:
https://docs.saltproject.io/salt/install-guide/en/latest/
Apache License 2.0
14.2k stars 5.48k forks source link

[BUG] 3007.X scheduler_before_connect throws TCP Publish Client encountered an exception, no start event is seen #66932

Open amalaguti opened 1 month ago

amalaguti commented 1 month ago

Description 3007.1 Windows minions with option scheduler_before_connect: True set throws exception TCP Publish Client encountered an exception while connecting to 127.0.0.1:4510: TimeoutError() on start, and does not generate /start event on master event bus

[TRACE   ] IPCServer: Handling connection to address: ('127.0.0.1', 52006)
[DEBUG   ] The functions from module 'mine' are being loaded by dir() on the loaded module
[DEBUG   ] LazyLoaded mine.update
[DEBUG   ] schedule.handle_func: adding this job to the jobcache with data {'id': 'minion-win-1', 'fun': 'mine.update', 'fun_args': [], 'schedule': '__mine_interval', 'jid': '20240927221656953118', 'pid': 420}
[TRACE   ] ==== evaluating schedule now None =====
[TRACE   ] Client disconnected from IPC 127.0.0.1:4511
[WARNING ] TCP Publish Client encountered an exception while connecting to 127.0.0.1:4510: TimeoutError(), will reconnect in 1 seconds -   File "C:\Program Files\Salt Project\Salt\Lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,

  File "C:\Program Files\Salt Project\Salt\Lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)

  File "C:\Program Files\Salt Project\Salt\salt-minion.exe\__main__.py", line 7, in <module>
    sys.exit(salt_minion())

  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\scripts.py", line 185, in salt_minion
    minion.start()

  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\cli\daemons.py", line 344, in start
    self._real_start()

  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\cli\daemons.py", line 356, in _real_start
    self.minion.tune_in()

  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\minion.py", line 1193, in tune_in
    self._bind()

  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\minion.py", line 1064, in _bind
    self.event = salt.utils.event.get_event(

  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\utils\event.py", line 135, in get_event
    return SaltEvent(

  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\utils\event.py", line 265, in __init__
    self.connect_pub()

  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\utils\event.py", line 348, in connect_pub
    self.subscriber = salt.transport.ipc_publish_client(

  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\transport\base.py", line 210, in ipc_publish_client
    return publish_client(opts, io_loop, **kwargs)

  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\transport\base.py", line 152, in publish_client
    return salt.transport.tcp.PublishClient(

  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\transport\tcp.py", line 220, in __init__
    super().__init__(opts, io_loop, **kwargs)

  File "C:\Program Files\Salt Project\Salt\Lib\site-packages\salt\transport\base.py", line 398, in __init__
    super().__init__()

[TRACE   ] TCP PubServer sending payload: topic_list=None '/salt/minion/minion_schedule_delete_complete\n\n��completeèschedule��__mine_interval��enabledèfunction�mine.update�jid_includeêmaxrunning\x02�minutes<�name�__mine_interval�return_job£runìrun_on_startåsplay��_stamp�2024-09-27T22:16:56.937494'
[TRACE   ] TCP PubServer finished publishing payload
[DEBUG   ] Decrypting the current master AES key
[DEBUG   ] salt.crypt.get_rsa_key: Loading private key
[DEBUG   ] Loaded minion key: C:\ProgramData\Salt Project\Salt\conf\pki\minion\minion.pem

No /start event is shown, only auth

salt/auth   {
    "_stamp": "2024-09-27T22:22:51.680278",
    "act": "accept",
    "id": "minion-win-1",
    "pub": "-----BEGIN PUBLIC KEY-----REDACTED-----END PUBLIC KEY-----",
    "result": true
}
minion/refresh/minion-win-1 {
    "Minion data cache refresh": "minion-win-1",
    "_stamp": "2024-09-27T22:22:51.961685"
}

The minion seems responsive after start, but not having the /start event seems to be a big issue.

The scheduler_before_connect option is required in this use case and works fine in 3006.x versions.

Setup 3007.x Windos minion with scheduler_before_connect: True

Expected behavior Minion should start without the exception, the scheduler_before_connect should work as expected and /start event should be seeing on the event bus

amalaguti commented 1 month ago

I discovered a little difference in behavior, if the minion is configured in multimaster (active/active), when the minion is started as service, net start salt-minion, there is no /start event.

But when the minion is started in foreground, salt-minion -l debug, I can randomly get a /start event in one or the other master, but not in both.

With the scheduler_before_connect option removed, both masters get the /start event

amalaguti commented 2 weeks ago

The problem seems to be async/await functions not being called properly, something we've seen recently in another minion connection related issues.

@dwoz @twangboy I would like to ask your review on this core changes due the complexities of this part of the code and general minion connection process, I'm not sure if this are how you want thigs to work for the minion connection setup.

The problem seems to be in the # Main Minion Tune In def tune_in(self, start=True): HERE

Inside this function, the function self.setup_scheduler() is called which triggers the exception.

I simply changed the setup_scheduler() function to be async async def setup_scheduler(self, before_connect=False):. Just with this change the exception is not longer thrown. I also changed the call to this function to use await: await self.setup_scheduler().

In addition, (I will mention this in a new issue), I've observed that this section of the code inside the #Main minion tune in def tune_in(self, start=True): HERE which is suppossed to setup_beacons and setup_scheduler when the options beacons_before_connect or scheduler_before_connect are true, does not seem to be ever triggered during the minion start up.

I added some logging messages and never seem to be called with start: True. There's another function async def _connect_minion(self, minion) from which minion.tune_in() is called with start=False.

if start:
            if self.opts.get("beacons_before_connect", False):
                self.setup_beacons(before_connect=True)
            if self.opts.get("scheduler_before_connect", False):
                self.setup_scheduler(before_connect=True)

I tried this with multimaster setup, two masters, the tune_in() function seems to be executed once per master connected.

amalaguti commented 2 weeks ago

@barneysowood fyi