Closed ntt-raraujo closed 9 months ago
Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey. Please be sure to review our Code of Conduct. Also, check out some of our community resources including:
There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar. If you have additional questions, email us at saltproject@vmware.com. We’re glad you’ve joined our community and look forward to doing awesome things with you!
having the same issue with salt-proxy's (napalm to nxos switches) since upgrading to 3006.x . previously 3004.x was our enviornment.
I just updated to version 3006.4. It improved a little (a few seconds) but still got the timeout issue for a proxy located on Asia (180ms delay from master to proxy) The time it takes to run a highstate on 3006.4 is still twice compared to 3004.2 version.
@ntt-raraujo @ITJamie
Is it possible for either of you to come up with an example state in which I can test against both 3004 and 3006 to see the difference? That may help me identify the cause of the behavior you are seeing.
@ntt-raraujo We fixed #65450 in 3006.5, can you test to see if that resolves this issue?
@dwoz 3006.5 fixed the issue. Thanks!
@dwoz Would you mind saying what the problem was? I couldn't find the issue on the release notes. I've been discussing this with my team for so long, it would be nice to have some closure. Thanks
It would also be great to know if its a master side change or minion side change
@ntt-raraujo @ITJamie Sorry for the late reply on this. I've been tied up working on other issues and just got back to this one. The was caused by a regression where the file client get re-created on each state run. The overhead of creating a new connection to the master multiple times during a highstate caused a substantial slow down. The issue and fix were on the minion side.
Fixed in 3006.5
Description After implementing Salt 3006, the timeout had to be changed from 30 seconds to 60 seconds, otherwise, the error 'minion did not respond' would occur when applying highstates to proxy-minions
we currently have a 3004 deployment that works fine with 30 seconds timeout.
I'm using the same files on 3006 and 3004 servers. (same pillars, master file, states files, custom modules and so on). The only difference is the Salt version and OS
Setup
Please be as specific as possible and give set-up details.
Both 3004 and 3006 servers/proxies are on the same subnets and have the same virtual-resources/vm settings.
Steps to Reproduce the behavior The first one with our current default timeout of 30 seconds, and the second one with 60 seconds timeout
Salt3004 version (same subnets and same files)
Expected behavior A lower timeout gap between our 3004 and 3006 deployment.
Versions Report
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.) ```yaml [root@SaltMaster ~]$ salt --versions Salt Version: Salt: 3006.2 Python Version: Python: 3.10.12 (main, Aug 3 2023, 21:47:10) [GCC 11.2.0] Dependency Versions: cffi: 1.14.6 cherrypy: unknown dateutil: 2.8.1 docker-py: Not Installed gitdb: Not Installed gitpython: Not Installed Jinja2: 3.1.2 libgit2: Not Installed looseversion: 1.0.2 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.2 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 22.0 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.9.8 pygit2: Not Installed python-gnupg: 0.4.8 PyYAML: 6.0.1 PyZMQ: 23.2.0 relenv: 0.13.3 smmap: Not Installed timelib: 0.2.4 Tornado: 4.5.3 ZMQ: 4.3.4 System Versions: dist: oracle 8.8 locale: utf-8 machine: x86_64 release: 5.4.17-2136.321.4.1.el8uek.x86_64 system: Linux version: Oracle Linux Server 8.8 [root@SaltProxy ~]# salt-proxy --versions Salt Version: Salt: 3006.2 Python Version: Python: 3.10.12 (main, Aug 3 2023, 21:47:10) [GCC 11.2.0] Dependency Versions: cffi: 1.14.6 cherrypy: 18.6.1 dateutil: 2.8.1 docker-py: Not Installed gitdb: Not Installed gitpython: Not Installed Jinja2: 3.1.2 libgit2: Not Installed looseversion: 1.0.2 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.2 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 22.0 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.9.8 pygit2: Not Installed python-gnupg: 0.4.8 PyYAML: 6.0.1 PyZMQ: 23.2.0 relenv: 0.13.3 smmap: Not Installed timelib: 0.2.4 Tornado: 4.5.3 ZMQ: 4.3.4 System Versions: dist: oracle 8.8 locale: utf-8 machine: x86_64 release: 5.4.17-2136.321.4.1.el8uek.x86_64 system: Linux version: Oracle Linux Server 8.8 ```Additional context Is there any other way to debug this issue? or a way to debug the zeromq itself to check for some transport issues?