saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Get access to the Salt software package repository here:
https://repo.saltproject.io/
Apache License 2.0
14.13k stars 5.47k forks source link

[BUG] [salt.client :1903][ERROR ][763] Message timed out #63979

Open ecstuchi opened 1 year ago

ecstuchi commented 1 year ago

Description I have a big Salt infrastructure with individual Salt Masters ranging from 16 to 3000 minions, where they are also Syndics. All of them are facing the same problem, intermittently. There is a Master of Master on top of them. Apparently the message bus has issues and Salt Master becomes a "zombie", not working anymore. Restarting the service fixes the issue, but Salt is unreliable this way.

Setup (Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)

Please be as specific as possible and give set-up details.

Steps to Reproduce the behavior I cannot reproduce it, but it happens with almost all masters intermittently.

`
From master's logs:
2023-03-27 20:59:02,060 [salt.utils.event :912 ][ERROR   ][80275] Event iteration failed with exception: 'list' object has no attribute 'items'
2023-03-27 20:59:02,773 [salt.utils.event :912 ][ERROR   ][80275] Event iteration failed with exception: 'list' object has no attribute 'items'
2023-03-27 20:59:43,016 [salt.utils.event :912 ][ERROR   ][80275] Event iteration failed with exception: 'user'
2023-03-27 21:01:11,943 [salt.utils.event :912 ][ERROR   ][80275] Event iteration failed with exception: 'user'
2023-03-27 21:39:38,738 [salt.utils.event :912 ][ERROR   ][80275] Event iteration failed with exception: 'user'
2023-03-28 14:58:11,940 [salt.payload     :117 ][CRITICAL][80275] Could not deserialize msgpack message. This often happens when trying to read a file not in binary mode. To see message payload, enable debug logging and retry. Exception: unpack(b) received extra data.
2023-03-28 14:58:11,943 [tornado.application:356 ][ERROR   ][80275] Future <salt.ext.tornado.concurrent.Future object at 0x7f8753f2a198> exception was never retrieved: Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/salt/payload.py", line 101, in loads
    ret = salt.utils.msgpack.unpackb(msg, **loads_kwargs)
  File "/usr/lib/python3.6/site-packages/salt/utils/msgpack.py", line 157, in unpackb
    return msgpack.unpackb(packed, **_sanitize_msgpack_unpack_kwargs(kwargs))
  File "msgpack/_unpacker.pyx", line 209, in msgpack._cmsgpack.unpackb
msgpack.exceptions.ExtraData: unpack(b) received extra data.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 309, in wrapper
    yielded = next(result)
  File "/usr/lib/python3.6/site-packages/salt/transport/zeromq.py", line 424, in handle_message
    payload = self.decode_payload(payload)
  File "/usr/lib/python3.6/site-packages/salt/transport/zeromq.py", line 448, in decode_payload
    payload = salt.payload.loads(payload[0])
  File "/usr/lib/python3.6/site-packages/salt/payload.py", line 121, in loads
    raise SaltDeserializationError(exc_msg) from exc
salt.exceptions.SaltDeserializationError: Could not deserialize msgpack message. See log for more info.
2023-03-28 15:07:36,258 [salt.client      :1903][ERROR   ][120722] Message timed out
2023-03-28 15:08:37,068 [salt.client      :1903][ERROR   ][120851] Message timed out
2023-03-28 15:09:37,853 [salt.client      :1903][ERROR   ][120975] Message timed out
2023-03-28 15:18:14,675 [salt.client      :1903][ERROR   ][122295] Message timed out
2023-03-28 15:19:15,470 [salt.client      :1903][ERROR   ][122423] Message timed out
2023-03-28 15:20:16,250 [salt.client      :1903][ERROR   ][122553] Message timed out
2023-03-28 15:31:28,527 [salt.client      :1903][ERROR   ][124481] Message timed out
2023-03-28 15:32:29,296 [salt.client      :1903][ERROR   ][124611] Message timed out
2023-03-28 15:33:30,068 [salt.client      :1903][ERROR   ][124737] Message timed out
2023-03-28 15:42:43,653 [salt.client      :1903][ERROR   ][126567] Message timed out
2023-03-28 15:43:44,456 [salt.client      :1903][ERROR   ][126735] Message timed out
2023-03-28 15:44:45,228 [salt.client      :1903][ERROR   ][126862] Message timed out
2023-03-28 15:53:23,838 [salt.client      :1903][ERROR   ][128267] Message timed out
2023-03-28 15:54:24,612 [salt.client      :1903][ERROR   ][128394] Message timed out
2023-03-28 15:55:25,404 [salt.client      :1903][ERROR   ][128521] Message timed out
2023-03-28 16:04:35,969 [salt.client      :1903][ERROR   ][129941] Message timed out
2023-03-28 16:05:36,767 [salt.client      :1903][ERROR   ][130066] Message timed out
2023-03-28 16:06:37,539 [salt.client      :1903][ERROR   ][130402] Message timed out
2023-03-28 16:08:10,343 [salt.client      :1903][ERROR   ][130929] Message timed out
2023-03-28 16:09:22,135 [salt.client      :1903][ERROR   ][358] Message timed out
2023-03-28 16:10:13,072 [salt.client      :1903][ERROR   ][541] Message timed out
2023-03-28 16:11:19,805 [salt.client      :1903][ERROR   ][763] Message timed out`
[root@saltmaster ~]# salt '*' test.ping 
[ERROR   ] Message timed out
Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.

[root@saltmaster ~]# systemctl status salt-master.service -l
â salt-master.service - The Salt Master Server
   Loaded: loaded (/usr/lib/systemd/system/salt-master.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2022-12-06 16:17:47 IST; 3 months 20 days ago
     Docs: man:salt-master(1)
           file:///usr/share/doc/salt/html/contents.html
           https://docs.saltproject.io/en/latest/contents.html
 Main PID: 79896 (salt-master)
   CGroup: /system.slice/salt-master.service
           ââ 1421 /usr/bin/python3 /usr/bin/salt-master
           ââ79896 /usr/bin/python3 /usr/bin/salt-master
           ââ80263 /usr/bin/python3 /usr/bin/salt-master
           ââ80265 /usr/bin/python3 /usr/bin/salt-master
           ââ80269 /usr/bin/python3 /usr/bin/salt-master
           ââ80270 /usr/bin/python3 /usr/bin/salt-master
           ââ80273 /usr/bin/python3 /usr/bin/salt-master
           ââ80274 /usr/bin/python3 /usr/bin/salt-master
           ââ80275 /usr/bin/python3 /usr/bin/salt-master
           ââ80276 /usr/bin/python3 /usr/bin/salt-master
           ââ80278 /usr/bin/python3 /usr/bin/salt-master
           ââ80279 /usr/bin/python3 /usr/bin/salt-master
           ââ80282 /usr/bin/python3 /usr/bin/salt-master
           ââ80283 /usr/bin/python3 /usr/bin/salt-master
           ââ80284 /usr/bin/python3 /usr/bin/salt-master
           ââ80298 /usr/bin/python3 /usr/bin/salt-master
           ââ80300 /usr/bin/python3 /usr/bin/salt-master

Mar 28 14:58:11 saltmaster.domain.com salt-master[79896]: Traceback (most recent call last):
Mar 28 14:58:11 saltmaster.domain.com salt-master[79896]: File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 309, in wrapper
Mar 28 14:58:11 saltmaster.domain.com salt-master[79896]: yielded = next(result)
Mar 28 14:58:11 saltmaster.domain.com salt-master[79896]: File "/usr/lib/python3.6/site-packages/salt/transport/zeromq.py", line 424, in handle_message
Mar 28 14:58:11 saltmaster.domain.com salt-master[79896]: payload = self.decode_payload(payload)
Mar 28 14:58:11 saltmaster.domain.com salt-master[79896]: File "/usr/lib/python3.6/site-packages/salt/transport/zeromq.py", line 448, in decode_payload
Mar 28 14:58:11 saltmaster.domain.com salt-master[79896]: payload = salt.payload.loads(payload[0])
Mar 28 14:58:11 saltmaster.domain.com salt-master[79896]: File "/usr/lib/python3.6/site-packages/salt/payload.py", line 121, in loads
Mar 28 14:58:11 saltmaster.domain.com salt-master[79896]: raise SaltDeserializationError(exc_msg) from exc
Mar 28 14:58:11 saltmaster.domain.com salt-master[79896]: salt.exceptions.SaltDeserializationError: Could not deserialize msgpack message. See log for more info.
[root@saltmaster ~]#

[root@saltmaster ~]# salt -l all '*' test.ping 
[TRACE   ] Setting up log file logging: {'log_path': '/var/log/salt/master', 'log_level': 'warning', 'log_format': '%(asctime)s,%(msecs)03d [%(name)-17s:%(lineno)-4d][%(levelname)-8s][%(process)d] %(message)s', 'date_format': '%Y-%m-%d %H:%M:%S', 'max_bytes': 0, 'backup_count': 0, 'user': 'root'}
[TRACE   ] The required configuration section, 'fluent_handler', was not found the in the configuration. Not loading the fluent logging handlers module.
[TRACE   ] None of the required configuration sections, 'logstash_udp_handler' and 'logstash_zmq_handler', were found in the configuration. Not loading the Logstash logging handlers module.
[TRACE   ] Error loading log_handlers.sentry_mod: Cannot find 'raven' python library, 'sentry_handler' config is empty or not defined
[TRACE   ] Processing <bound method SaltfileMixIn.process_saltfile of <salt.cli.salt.SaltCMD object at 0x2b5ce8457b38>>
[TRACE   ] Processing <bound method ConfigDirMixIn.process_config_dir of <salt.cli.salt.SaltCMD object at 0x2b5ce8457b38>>
[DEBUG   ] Reading configuration from /etc/salt/master
[DEBUG   ] Using cached minion ID from /etc/salt/minion_id: saltmaster.domain.com
[DEBUG   ] Missing configuration file: /root/.saltrc
[TRACE   ] Processing <bound method ExtendedTargetOptionsMixIn.process_pillar_target of <salt.cli.salt.SaltCMD object at 0x2b5ce8457b38>>
[TRACE   ] Processing <bound method OutputOptionsMixIn.process_output of <salt.cli.salt.SaltCMD object at 0x2b5ce8457b38>>
[TRACE   ] Processing <bound method OutputOptionsMixIn.process_output_file of <salt.cli.salt.SaltCMD object at 0x2b5ce8457b38>>
[TRACE   ] Processing <bound method OutputOptionsMixIn.process_state_verbose of <salt.cli.salt.SaltCMD object at 0x2b5ce8457b38>>
[TRACE   ] Processing <bound method NoParseMixin.process_no_parse of <salt.cli.salt.SaltCMD object at 0x2b5ce8457b38>>
[TRACE   ] Processing functools.partial(<function TargetOptionsMixIn._create_process_functions.<locals>.process at 0x2b5ce835bea0>, <CustomOption at 0x2b5cfb207eb8: -H/--hosts>)
[TRACE   ] Processing functools.partial(<function TargetOptionsMixIn._create_process_functions.<locals>.process at 0x2b5cf72c7488>, <CustomOption at 0x2b5cfb207e80: -E/--pcre>)
[TRACE   ] Processing functools.partial(<function TargetOptionsMixIn._create_process_functions.<locals>.process at 0x2b5cfb430ae8>, <CustomOption at 0x2b5cfb207e48: -L/--list>)
[TRACE   ] Processing functools.partial(<function TargetOptionsMixIn._create_process_functions.<locals>.process at 0x2b5cfb430b70>, <CustomOption at 0x2b5cfb207f28: -G/--grain>)
[TRACE   ] Processing functools.partial(<function TargetOptionsMixIn._create_process_functions.<locals>.process at 0x2b5cfb430bf8>, <CustomOption at 0x2b5cfb207f98: -P/--grain-pcre>)
[TRACE   ] Processing functools.partial(<function TargetOptionsMixIn._create_process_functions.<locals>.process at 0x2b5cfb430c80>, <CustomOption at 0x2b5cfb433048: -N/--nodegroup>)
[TRACE   ] Processing functools.partial(<function TargetOptionsMixIn._create_process_functions.<locals>.process at 0x2b5cfb430d08>, <CustomOption at 0x2b5cfb433080: -R/--range>)
[TRACE   ] Processing functools.partial(<function TargetOptionsMixIn._create_process_functions.<locals>.process at 0x2b5cfb430e18>, <CustomOption at 0x2b5cfb433160: -C/--compound>)
[TRACE   ] Processing functools.partial(<function TargetOptionsMixIn._create_process_functions.<locals>.process at 0x2b5cfb430ea0>, <CustomOption at 0x2b5cfb4332b0: -J/--pillar-pcre>)
[TRACE   ] Processing functools.partial(<function TargetOptionsMixIn._create_process_functions.<locals>.process at 0x2b5cfb430d90>, <CustomOption at 0x2b5cfb4332e8: -S/--ipcidr>)
[TRACE   ] Processing functools.partial(<function OutputOptionsMixIn._mixin_setup.<locals>.process at 0x2b5cfb439048>, <CustomOption at 0x2b5cfb433748: --out-indent/--output-indent>)
[TRACE   ] Processing functools.partial(<function OutputOptionsMixIn._mixin_setup.<locals>.process at 0x2b5cfb4390d0>, <CustomOption at 0x2b5cfb433828: --out-file-append/--output-file-append>)
[TRACE   ] Processing functools.partial(<function OutputOptionsMixIn._mixin_setup.<locals>.process at 0x2b5cfb430f28>, <CustomOption at 0x2b5cfb433860: --no-color/--no-colour>)
[TRACE   ] Processing functools.partial(<function OutputOptionsMixIn._mixin_setup.<locals>.process at 0x2b5cfb439158>, <CustomOption at 0x2b5cfb433898: --force-color/--force-colour>)
[TRACE   ] Processing functools.partial(<function OutputOptionsMixIn._mixin_setup.<locals>.process at 0x2b5cfb4391e0>, <CustomOption at 0x2b5cfb4338d0: --state-output/--state_output>)
[TRACE   ] Processing <bound method LogLevelMixIn.process_log_level of <salt.cli.salt.SaltCMD object at 0x2b5ce8457b38>>
[TRACE   ] Processing <bound method LogLevelMixIn.process_log_file of <salt.cli.salt.SaltCMD object at 0x2b5ce8457b38>>
[TRACE   ] Processing <bound method LogLevelMixIn.process_log_level_logfile of <salt.cli.salt.SaltCMD object at 0x2b5ce8457b38>>
[TRACE   ] Processing <function TargetOptionsMixIn._mixin_after_parsed at 0x2b5cfb21e268>
[TRACE   ] Processing <function OutputOptionsMixIn._mixin_after_parsed at 0x2b5cfb21e8c8>
[TRACE   ] Processing <function SaltCMDOptionParser._mixin_after_parsed at 0x2b5cfb428378>
[TRACE   ] Processing <function LogLevelMixIn.__setup_logging_routines at 0x2b5cfb21d378>
[TRACE   ] Processing <function MergeConfigMixIn.__merge_config_with_cli at 0x2b5cfb21bd90>
[TRACE   ] Processing <function LogLevelMixIn.__setup_console_logger_config at 0x2b5cfb21d620>
[TRACE   ] Processing <function LogLevelMixIn.__setup_logfile_logger_config at 0x2b5cfb21d6a8>
[TRACE   ] Processing <function LogLevelMixIn.__setup_logging_config at 0x2b5cfb21d730>
[TRACE   ] Processing <function LogLevelMixIn.__verify_logging at 0x2b5cfb21d840>
[WARNING ] Insecure logging configuration detected! Sensitive data may be logged.
[TRACE   ] Processing <function LogLevelMixIn.__setup_logging at 0x2b5cfb21d7b8>
[TRACE   ] Setting up console logging: {'log_level': 'all', 'log_format': '[%(levelname)-8s] %(message)s', 'date_format': '%H:%M:%S'}
[TRACE   ] Setting up log file logging: {'log_path': '/var/log/salt/master', 'log_level': 'warning', 'log_format': '%(asctime)s,%(msecs)03d [%(name)-17s:%(lineno)-4d][%(levelname)-8s][%(process)d] %(message)s', 'date_format': '%Y-%m-%d %H:%M:%S', 'max_bytes': 0, 'backup_count': 0, 'user': 'root'}
[TRACE   ] The required configuration section, 'fluent_handler', was not found the in the configuration. Not loading the fluent logging handlers module.
[TRACE   ] None of the required configuration sections, 'logstash_udp_handler' and 'logstash_zmq_handler', were found in the configuration. Not loading the Logstash logging handlers module.
[TRACE   ] Error loading log_handlers.sentry_mod: Cannot find 'raven' python library, 'sentry_handler' config is empty or not defined
[DEBUG   ] Configuration file path: /etc/salt/master
[DEBUG   ] Reading configuration from /etc/salt/master
[DEBUG   ] Using cached minion ID from /etc/salt/minion_id: saltmaster.domain.com
[DEBUG   ] Missing configuration file: /root/.saltrc
[DEBUG   ] MasterEvent PUB socket URI: /var/run/salt/master/master_event_pub.ipc
[DEBUG   ] MasterEvent PULL socket URI: /var/run/salt/master/master_event_pull.ipc
[TRACE   ] IPCClient: Connecting to socket: /var/run/salt/master/master_event_pub.ipc
[TRACE   ] ReqChannel send clear load={'cmd': 'publish', 'tgt': '*', 'fun': 'test.ping', 'arg': [], 'key': 'Wg/F7hQ5e7v0bJ+xte5KDko5TjywgJS5LAgP+Um5Ua6E1S4T11rY1G5hxT0x8yTXq4vKqRNOlJo=', 'tgt_type': 'glob', 'ret': '', 'jid': '', 'kwargs': {'show_timeout': True, 'show_jid': False, 'delimiter': ':'}, 'user': 'root'}
[TRACE   ] Failed to send msg SaltReqTimeoutError('Message timed out',)
[TRACE   ] ReqChannel send clear load={'cmd': 'publish', 'tgt': '*', 'fun': 'test.ping', 'arg': [], 'key': 'Wg/F7hQ5e7v0bJ+xte5KDko5TjywgJS5LAgP+Um5Ua6E1S4T11rY1G5hxT0x8yTXq4vKqRNOlJo=', 'tgt_type': 'glob', 'ret': '', 'jid': '', 'kwargs': {'show_timeout': True, 'show_jid': False, 'delimiter': ':'}, 'user': 'root'}
[TRACE   ] Failed to send msg SaltReqTimeoutError('Message timed out',)
[TRACE   ] ReqChannel send clear load={'cmd': 'publish', 'tgt': '*', 'fun': 'test.ping', 'arg': [], 'key': 'Wg/F7hQ5e7v0bJ+xte5KDko5TjywgJS5LAgP+Um5Ua6E1S4T11rY1G5hxT0x8yTXq4vKqRNOlJo=', 'tgt_type': 'glob', 'ret': '', 'jid': '', 'kwargs': {'show_timeout': True, 'show_jid': False, 'delimiter': ':'}, 'user': 'root'}
[TRACE   ] Failed to send msg SaltReqTimeoutError('Message timed out',)
[ERROR   ] Message timed out
[DEBUG   ] Closing AsyncReqChannel instance
[DEBUG   ] Closing IPCMessageSubscriber instance
[DEBUG   ] The functions from module 'nested' are being loaded by dir() on the loaded module
[DEBUG   ] LazyLoaded nested.output
[TRACE   ] data = Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.
Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.
[root@saltmaster ~]# 

` Expected behavior Salt Master being reliable and not having these issues anymore.

Screenshots If applicable, add screenshots to help explain your problem.

Versions Report

```yaml [root@saltmaster ~]# salt --versions-report Salt Version: Salt: 3005.1 Dependency Versions: cffi: Not Installed cherrypy: Not Installed dateutil: Not Installed docker-py: Not Installed gitdb: Not Installed gitpython: Not Installed Jinja2: 2.11.1 libgit2: Not Installed M2Crypto: 0.35.2 Mako: Not Installed msgpack: 0.6.2 msgpack-pure: Not Installed mysql-python: Not Installed pycparser: Not Installed pycrypto: Not Installed pycryptodome: Not Installed pygit2: Not Installed Python: 3.6.8 (default, Aug 13 2020, 07:46:32) python-gnupg: Not Installed PyYAML: 3.13 PyZMQ: 18.0.1 smmap: Not Installed timelib: Not Installed Tornado: 4.5.3 ZMQ: 4.1.4 System Versions: dist: rhel 7.9 Maipo locale: UTF-8 machine: x86_64 release: 3.10.0-1160.36.2.el7.x86_64 system: Linux version: Red Hat Enterprise Linux Server 7.9 Maipo ```

Additional context

[root@saltmaster ~]# 

[root@saltmaster ~]# rpm -qa | grep salt
salt-minion-3005.1-1.el7.noarch
salt-syndic-3005.1-1.el7.noarch
salt-3005.1-1.el7.noarch
salt-master-3005.1-1.el7.noarch
[root@saltmaster ~]# rpm -qa | grep zero
zeromq-4.1.4-7.el7.x86_64
[root@saltmaster ~]# rpm -qa | grep msg
python36-msgpack-0.6.2-2.el7.x86_64
[root@saltmaster ~]# 

[root@saltmaster ~]# salt-key | wc -l
16
[root@saltmaster ~]# 

# cat /etc/salt/master | egrep -v "#|^$" 
keep_jobs: 8
worker_threads: 10
batch_safe_limit: 200
batch_safe_size: 100
syndic_master: salt-master.domain.com
syndic_wait: 20
ecstuchi commented 1 year ago

This is still happening with 3006.

$ salt -V
Salt Version:
          Salt: 3006.0

Python Version:
        Python: 3.10.11 (main, Apr 14 2023, 05:57:16) [GCC 11.2.0]

Dependency Versions:
          cffi: 1.14.6
      cherrypy: unknown
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.2
       libgit2: Not Installed
  looseversion: 1.0.2
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 22.0
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.9.8
        pygit2: Not Installed
  python-gnupg: 0.4.8
        PyYAML: 5.4.1
         PyZMQ: 23.2.0
        relenv: 0.11.2
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: rhel 7.9 Maipo
        locale: utf-8
       machine: x86_64
       release: 3.10.0-1160.88.1.el7.x86_64
        system: Linux
       version: Red Hat Enterprise Linux Server 7.9 Maipo

Tailing the log:

    yielded = next(result)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 431, in handle_message
    payload = self.decode_payload(payload)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 455, in decode_payload
    payload = salt.payload.loads(payload[0])
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/payload.py", line 121, in loads
    raise SaltDeserializationError(exc_msg) from exc
salt.exceptions.SaltDeserializationError: Could not deserialize msgpack message. See log for more info.
2023-05-02 12:53:42,369 [salt.client      :1906][ERROR   ][19750] Message timed out
2023-05-02 12:54:42,907 [salt.client      :1906][ERROR   ][20667] Message timed out

Do you need me collecting more data for this?

lapfrank12 commented 1 year ago

Same for us with 3006.1. It crashes every few days and breaks a ton of our automation all the time. I'm gonna have to schedule a cron job that looks at the /var/log/salt/master file to automatically restart the service when this happens.

ecstuchi commented 1 year ago

Same for us with 3006.1. It crashes every few days and breaks a ton of our automation all the time. I'm gonna have to schedule a cron job that looks at the /var/log/salt/master file to automatically restart the service when this happens.

I had to do that. I've setup a cron job running like a watchdog every 5 minutes and it restarts the service when the issue happens. Then it also email my team when it happens. It's crazy how often it happens and it doesn't matter the amount of minions on the master.

meaksh commented 1 year ago

This sounds related to https://github.com/saltstack/salt/issues/64061

keslerm commented 1 year ago

I've been seeing this same issue with some of masters.

We are using the postgres jsonb returner and I found if i disable that everything suddenly starts working fine.