[BUG] Salt 3006.2 requiring a higher timeout value than 3004.2

ntt-raraujo commented 1 year ago

Description After implementing Salt 3006, the timeout had to be changed from 30 seconds to 60 seconds, otherwise, the error 'minion did not respond' would occur when applying highstates to proxy-minions

we currently have a 3004 deployment that works fine with 30 seconds timeout.

I'm using the same files on 3006 and 3004 servers. (same pillars, master file, states files, custom modules and so on). The only difference is the Salt version and OS

Setup

Please be as specific as possible and give set-up details.

[ ] on-prem machine
[X] VM (Virtualbox, KVM, etc. please specify) - VMware
[ ] VM running on a cloud service, please be explicit and add details
[ ] container (Kubernetes, Docker, containerd, etc. please specify)
[ ] or a combination, please be explicit
[ ] jails if it is FreeBSD
[ ] classic packaging
[X] onedir packaging
[ ] used bootstrap to install

Both 3004 and 3006 servers/proxies are on the same subnets and have the same virtual-resources/vm settings.

Steps to Reproduce the behavior The first one with our current default timeout of 30 seconds, and the second one with 60 seconds timeout

[root@saltmaster3006 ~]$ salt 'hostname-omitted' state.test -l debug
[DEBUG   ] Reading configuration from /etc/salt/master
[DEBUG   ] Using cached minion ID from /etc/salt/minion_id: saltmaster3006.infra.omitted-domain.run
[DEBUG   ] Missing configuration file: /root/.saltrc
[WARNING ] Insecure logging configuration detected! Sensitive data may be logged.
[DEBUG   ] Configuration file path: /etc/salt/master
[DEBUG   ] Reading configuration from /etc/salt/master
[DEBUG   ] Using cached minion ID from /etc/salt/minion_id: saltmaster3006.infra.omitted-domain.run
[DEBUG   ] Missing configuration file: /root/.saltrc
[DEBUG   ] MasterEvent PUB socket URI: /var/run/salt/master/master_event_pub.ipc
[DEBUG   ] MasterEvent PULL socket URI: /var/run/salt/master/master_event_pull.ipc
[DEBUG   ] Closing AsyncReqChannel instance
[DEBUG   ] Marking 'base64_encode' as a jinja filter
[DEBUG   ] Marking 'base64_decode' as a jinja filter
[DEBUG   ] Marking 'md5' as a jinja filter
[DEBUG   ] Marking 'sha1' as a jinja filter
[DEBUG   ] Marking 'sha256' as a jinja filter
[DEBUG   ] Marking 'sha512' as a jinja filter
[DEBUG   ] Marking 'hmac' as a jinja filter
[DEBUG   ] Marking 'hmac_compute' as a jinja filter
[DEBUG   ] Marking 'random_hash' as a jinja filter
[DEBUG   ] Marking 'rand_str' as a jinja filter
[DEBUG   ] Marking 'file_hashsum' as a jinja filter
[DEBUG   ] Marking 'http_query' as a jinja filter
[DEBUG   ] Marking 'ifelse' as a jinja global
[DEBUG   ] Marking 'strftime' as a jinja filter
[DEBUG   ] Marking 'date_format' as a jinja filter
[DEBUG   ] Marking 'raise' as a jinja global
[DEBUG   ] Marking 'match' as a jinja test
[DEBUG   ] Marking 'equalto' as a jinja test
[DEBUG   ] Marking 'skip' as a jinja filter
[DEBUG   ] Marking 'sequence' as a jinja filter
[DEBUG   ] Marking 'to_bool' as a jinja filter
[DEBUG   ] Marking 'indent' as a jinja filter
[DEBUG   ] Marking 'tojson' as a jinja filter
[DEBUG   ] Marking 'quote' as a jinja filter
[DEBUG   ] Marking 'regex_escape' as a jinja filter
[DEBUG   ] Marking 'regex_search' as a jinja filter
[DEBUG   ] Marking 'regex_match' as a jinja filter
[DEBUG   ] Marking 'regex_replace' as a jinja filter
[DEBUG   ] Marking 'uuid' as a jinja filter
[DEBUG   ] Marking 'unique' as a jinja filter
[DEBUG   ] Marking 'min' as a jinja filter
[DEBUG   ] Marking 'max' as a jinja filter
[DEBUG   ] Marking 'avg' as a jinja filter
[DEBUG   ] Marking 'union' as a jinja filter
[DEBUG   ] Marking 'intersect' as a jinja filter
[DEBUG   ] Marking 'difference' as a jinja filter
[DEBUG   ] Marking 'symmetric_difference' as a jinja filter
[DEBUG   ] Marking 'method_call' as a jinja filter
[DEBUG   ] Marking 'yaml_dquote' as a jinja filter
[DEBUG   ] Marking 'yaml_squote' as a jinja filter
[DEBUG   ] Marking 'yaml_encode' as a jinja filter
[DEBUG   ] The functions from module 'local_cache' are being loaded by dir() on the loaded module
[DEBUG   ] LazyLoaded local_cache.get_load
[DEBUG   ] Reading minion list from /var/cache/salt/master/jobs/8d/0204ad3e49d1b7fca67bc2fa98026eda95d14ee8f49c4b9b6b524b8f2bd491/.minions.p
[DEBUG   ] get_iter_returns for jid 20231013025401457648 sent to {'hostname-omitted'} will timeout at 02:54:31.548023
[DEBUG   ] Checking whether jid 20231013025401457648 is still running
[DEBUG   ] Closing AsyncReqChannel instance
[DEBUG   ] retcode missing from client return
[DEBUG   ] return event: {'hostname-omitted': {'failed': True}}
[DEBUG   ] The functions from module 'localfs' are being loaded by dir() on the loaded module
[DEBUG   ] LazyLoaded localfs.init_kwargs
[DEBUG   ] The functions from module 'localfs' are being loaded by dir() on the loaded module
[DEBUG   ] LazyLoaded localfs.init_kwargs
[DEBUG   ] The functions from module 'no_return' are being loaded by dir() on the loaded module
[DEBUG   ] LazyLoaded no_return.output
hostname-omitted:
    Minion did not return. [Not connected]
[DEBUG   ] Closing IPCMessageSubscriber instance
ERROR: Minions returned with non-zero exit code

[root@saltmaster3006 ~]$ salt 'hostname-omitted' state.test --timeout 60 -l debug
[DEBUG   ] Reading configuration from /etc/salt/master
[DEBUG   ] Using cached minion ID from /etc/salt/minion_id: saltmaster3006.infra.omitted-domain.run
[DEBUG   ] Missing configuration file: /root/.saltrc
[WARNING ] Insecure logging configuration detected! Sensitive data may be logged.
[DEBUG   ] Configuration file path: /etc/salt/master
[DEBUG   ] Reading configuration from /etc/salt/master
[DEBUG   ] Using cached minion ID from /etc/salt/minion_id: saltmaster3006.infra.omitted-domain.run
[DEBUG   ] Missing configuration file: /root/.saltrc
[DEBUG   ] MasterEvent PUB socket URI: /var/run/salt/master/master_event_pub.ipc
[DEBUG   ] MasterEvent PULL socket URI: /var/run/salt/master/master_event_pull.ipc
[DEBUG   ] Closing AsyncReqChannel instance
[DEBUG   ] Marking 'base64_encode' as a jinja filter
[DEBUG   ] Marking 'base64_decode' as a jinja filter
[DEBUG   ] Marking 'md5' as a jinja filter
[DEBUG   ] Marking 'sha1' as a jinja filter
[DEBUG   ] Marking 'sha256' as a jinja filter
[DEBUG   ] Marking 'sha512' as a jinja filter
[DEBUG   ] Marking 'hmac' as a jinja filter
[DEBUG   ] Marking 'hmac_compute' as a jinja filter
[DEBUG   ] Marking 'random_hash' as a jinja filter
[DEBUG   ] Marking 'rand_str' as a jinja filter
[DEBUG   ] Marking 'file_hashsum' as a jinja filter
[DEBUG   ] Marking 'http_query' as a jinja filter
[DEBUG   ] Marking 'ifelse' as a jinja global
[DEBUG   ] Marking 'strftime' as a jinja filter
[DEBUG   ] Marking 'date_format' as a jinja filter
[DEBUG   ] Marking 'raise' as a jinja global
[DEBUG   ] Marking 'match' as a jinja test
[DEBUG   ] Marking 'equalto' as a jinja test
[DEBUG   ] Marking 'skip' as a jinja filter
[DEBUG   ] Marking 'sequence' as a jinja filter
[DEBUG   ] Marking 'to_bool' as a jinja filter
[DEBUG   ] Marking 'indent' as a jinja filter
[DEBUG   ] Marking 'tojson' as a jinja filter
[DEBUG   ] Marking 'quote' as a jinja filter
[DEBUG   ] Marking 'regex_escape' as a jinja filter
[DEBUG   ] Marking 'regex_search' as a jinja filter
[DEBUG   ] Marking 'regex_match' as a jinja filter
[DEBUG   ] Marking 'regex_replace' as a jinja filter
[DEBUG   ] Marking 'uuid' as a jinja filter
[DEBUG   ] Marking 'unique' as a jinja filter
[DEBUG   ] Marking 'min' as a jinja filter
[DEBUG   ] Marking 'max' as a jinja filter
[DEBUG   ] Marking 'avg' as a jinja filter
[DEBUG   ] Marking 'union' as a jinja filter
[DEBUG   ] Marking 'intersect' as a jinja filter
[DEBUG   ] Marking 'difference' as a jinja filter
[DEBUG   ] Marking 'symmetric_difference' as a jinja filter
[DEBUG   ] Marking 'method_call' as a jinja filter
[DEBUG   ] Marking 'yaml_dquote' as a jinja filter
[DEBUG   ] Marking 'yaml_squote' as a jinja filter
[DEBUG   ] Marking 'yaml_encode' as a jinja filter
[DEBUG   ] The functions from module 'local_cache' are being loaded by dir() on the loaded module
[DEBUG   ] LazyLoaded local_cache.get_load
[DEBUG   ] Reading minion list from /var/cache/salt/master/jobs/4f/b4c4fa29b7319d60e114ce416745b7754f9eb081c9c4651e043ae6dc6cb8c5/.minions.p
[DEBUG   ] get_iter_returns for jid 20231013025540948005 sent to {'hostname-omitted'} will timeout at 02:56:41.035151
[DEBUG   ] Checking whether jid 20231013025540948005 is still running
[DEBUG   ] Closing AsyncReqChannel instance
[DEBUG   ] retcode missing from client return
[DEBUG   ] jid 20231013025540948005 return from hostname-omitted
[DEBUG   ] return event: {'hostname-omitted': < omitted output >...}
[DEBUG   ] The functions from module 'highstate' are being loaded by dir() on the loaded module
[DEBUG   ] LazyLoaded highstate.output
hostname-omitted:
----------
          ID: juniper_base_template
    Function: netconfig.managed
      Result: None
     Comment: Configuration discarded.

              Configuration diff:

              omitted output
              omitted output
              omitted output
              omitted output
              omitted output
              omitted output
              omitted output
              omitted output
     Started: 02:56:10.437053
    Duration: 8491.948 ms
     Changes:
----------
          ID: firewall_networking
    Function: netconfig.managed
      Result: None
     Comment: Configuration discarded.

              Configuration diff:

              [omitted output
              !omitted output
              !omitted output
              [omitted output
              !omitted output
              [omitted output
               omitted output
              !omitted output
     Started: 02:56:18.929229
    Duration: 40837.934 ms
     Changes:

Summary for hostname-omitted
------------
Succeeded: 2 (unchanged=2)
Failed:    0
------------
Total states run:     2
Total run time:  49.330 s
[DEBUG   ] jid 20231013025540948005 found all minions {'hostname-omitted'}
[DEBUG   ] Closing IPCMessageSubscriber instance
[root@saltmaster3006 ~]$

Salt3004 version (same subnets and same files)

[root@saltmaster3004 ~]$ salt 'hostname-omitted' state.test -l debug
[DEBUG   ] Reading configuration from /etc/salt/master
[DEBUG   ] Using cached minion ID from /etc/salt/minion_id: saltmaster3004
[DEBUG   ] Missing configuration file: /root/.saltrc
[DEBUG   ] Using importlib_metadata to load entry points
[DEBUG   ] Override  __grains__: <module 'salt.loaded.int.log_handlers.sentry_mod' from '/usr/lib/python3.6/site-packages/salt/log/handlers/sentry_mod.py'>
[DEBUG   ] Configuration file path: /etc/salt/master
[WARNING ] Insecure logging configuration detected! Sensitive data may be logged.
[DEBUG   ] Reading configuration from /etc/salt/master
[DEBUG   ] Using cached minion ID from /etc/salt/minion_id: saltmaster3004
[DEBUG   ] Missing configuration file: /root/.saltrc
[DEBUG   ] MasterEvent PUB socket URI: /var/run/salt/master/master_event_pub.ipc
[DEBUG   ] MasterEvent PULL socket URI: /var/run/salt/master/master_event_pull.ipc
[DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://127.0.0.1:4506
[DEBUG   ] Trying to connect to: tcp://127.0.0.1:4506
[DEBUG   ] Closing AsyncZeroMQReqChannel instance
[DEBUG   ] LazyLoaded local_cache.get_load
[DEBUG   ] Reading minion list from /var/cache/salt/master/jobs/6c/50d37b9f3910082670f9f4a00c6b59716f0a0e2cd80e8958f29695b6742d9e/.minions.p
[DEBUG   ] get_iter_returns for jid 20231013024610491474 sent to {'hostname-omitted'} will timeout at 02:46:40.498406
[DEBUG   ] Checking whether jid 20231013024610491474 is still running
[DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://127.0.0.1:4506
[DEBUG   ] Trying to connect to: tcp://127.0.0.1:4506
[DEBUG   ] Closing AsyncZeroMQReqChannel instance
[DEBUG   ] retcode missing from client return
[DEBUG   ] Checking whether jid 20231013024610491474 is still running
[DEBUG   ] Connecting the Minion to the Master URI (for the return server): tcp://127.0.0.1:4506
[DEBUG   ] Trying to connect to: tcp://127.0.0.1:4506
[DEBUG   ] Closing AsyncZeroMQReqChannel instance
[DEBUG   ] retcode missing from client return
[DEBUG   ] jid 20231013024610491474 return from hostname-omitted
[DEBUG   ] return event: {'hostname-omitted': < omitted output >...}
[DEBUG   ] Using importlib_metadata to load entry points
[DEBUG   ] LazyLoaded highstate.output
hostname-omitted:
----------
          ID: juniper_base_template
    Function: netconfig.managed
      Result: None
     Comment: Configuration discarded.

              Configuration diff:

              omitted output
              omitted output
              omitted output
              omitted output
              omitted output
              omitted output
              omitted output
              omitted output
     Started: 02:46:33.263478
    Duration: 3971.695 ms
     Changes:
----------
          ID: firewall_networking
    Function: netconfig.managed
      Result: None
     Comment: Configuration discarded.

              Configuration diff:

              [omitted output
              !omitted output
              !omitted output
              [omitted output
              !omitted output
              [omitted output
               omitted output
              !omitted output
     Started: 02:46:37.235416
    Duration: 14413.841 ms
     Changes:

Summary for hostname-omitted
------------
Succeeded: 2 (unchanged=2)
Failed:    0
------------
Total states run:     2
Total run time:  18.386 s
[DEBUG   ] jid 20231013024610491474 found all minions {'hostname-omitted'}
[DEBUG   ] Closing IPCMessageSubscriber instance
You have mail in /var/spool/mail/root

Expected behavior A lower timeout gap between our 3004 and 3006 deployment.

Versions Report

salt --versions-report

(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.) ```yaml [root@SaltMaster ~]$ salt --versions Salt Version: Salt: 3006.2 Python Version: Python: 3.10.12 (main, Aug 3 2023, 21:47:10) [GCC 11.2.0] Dependency Versions: cffi: 1.14.6 cherrypy: unknown dateutil: 2.8.1 docker-py: Not Installed gitdb: Not Installed gitpython: Not Installed Jinja2: 3.1.2 libgit2: Not Installed looseversion: 1.0.2 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.2 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 22.0 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.9.8 pygit2: Not Installed python-gnupg: 0.4.8 PyYAML: 6.0.1 PyZMQ: 23.2.0 relenv: 0.13.3 smmap: Not Installed timelib: 0.2.4 Tornado: 4.5.3 ZMQ: 4.3.4 System Versions: dist: oracle 8.8 locale: utf-8 machine: x86_64 release: 5.4.17-2136.321.4.1.el8uek.x86_64 system: Linux version: Oracle Linux Server 8.8 [root@SaltProxy ~]# salt-proxy --versions Salt Version: Salt: 3006.2 Python Version: Python: 3.10.12 (main, Aug 3 2023, 21:47:10) [GCC 11.2.0] Dependency Versions: cffi: 1.14.6 cherrypy: 18.6.1 dateutil: 2.8.1 docker-py: Not Installed gitdb: Not Installed gitpython: Not Installed Jinja2: 3.1.2 libgit2: Not Installed looseversion: 1.0.2 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.2 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 22.0 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.9.8 pygit2: Not Installed python-gnupg: 0.4.8 PyYAML: 6.0.1 PyZMQ: 23.2.0 relenv: 0.13.3 smmap: Not Installed timelib: 0.2.4 Tornado: 4.5.3 ZMQ: 4.3.4 System Versions: dist: oracle 8.8 locale: utf-8 machine: x86_64 release: 5.4.17-2136.321.4.1.el8uek.x86_64 system: Linux version: Oracle Linux Server 8.8 ```

Additional context Is there any other way to debug this issue? or a way to debug the zeromq itself to check for some transport issues?

welcome[bot] commented 1 year ago

Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey. Please be sure to review our Code of Conduct. Also, check out some of our community resources including:

There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar. If you have additional questions, email us at saltproject@vmware.com. We’re glad you’ve joined our community and look forward to doing awesome things with you!

ITJamie commented 1 year ago

having the same issue with salt-proxy's (napalm to nxos switches) since upgrading to 3006.x . previously 3004.x was our enviornment.

ntt-raraujo commented 12 months ago

I just updated to version 3006.4. It improved a little (a few seconds) but still got the timeout issue for a proxy located on Asia (180ms delay from master to proxy) The time it takes to run a highstate on 3006.4 is still twice compared to 3004.2 version.

dwoz commented 11 months ago

@ntt-raraujo @ITJamie

Is it possible for either of you to come up with an example state in which I can test against both 3004 and 3006 to see the difference? That may help me identify the cause of the behavior you are seeing.

dwoz commented 10 months ago

@ntt-raraujo We fixed #65450 in 3006.5, can you test to see if that resolves this issue?

ntt-raraujo commented 10 months ago

@dwoz 3006.5 fixed the issue. Thanks!

ntt-raraujo commented 10 months ago

@dwoz Would you mind saying what the problem was? I couldn't find the issue on the release notes. I've been discussing this with my team for so long, it would be nice to have some closure. Thanks

ITJamie commented 10 months ago

It would also be great to know if its a master side change or minion side change

dwoz commented 9 months ago

@ntt-raraujo @ITJamie Sorry for the late reply on this. I've been tied up working on other issues and just got back to this one. The was caused by a regression where the file client get re-created on each state run. The overhead of creating a new connection to the master multiple times during a highstate caused a substantial slow down. The issue and fix were on the minion side.

dwoz commented 9 months ago

Fixed in 3006.5

saltstack / salt

[BUG] Salt 3006.2 requiring a higher timeout value than 3004.2 #65397