Closed dsumsky closed 9 years ago
I can see the behavior with other functions, e.g. when I call from the master
salt '*' grains.get id
The command return minion IDs for a bunch of minions successfully but then get stuck in a loop waiting for the same minions like above:
Execution is still running on CAD-xxxxxBxxx01-xxxxxxxx-and-golden-01-XXXXf7d5
Execution is still running on M3-xxxxxBxxxCUS01-xxxxxxxx-m3-b02-vpc2-XXXX16e5
Execution is still running on CAD-xxxxxBxxx01-and-GovCloud-0010-XXXXda97
Execution is still running on IMC-xxxxxBxxx01-xxxxxxxx-ind-epak1
Execution is still running on CAD-xxxxxBxxx01-xxxxxxxx-and-whitehat-test
Execution is still running on M3-xxxxxBxxxST02-xxxxxxxx-m3-b02-vpc2-XXXX16e5
Execution is still running on CAD-xxxxxBxxx01-approva-testing-01-XXXX85e2
Execution is still running on M3Fashion-xxxxxBxxxBE01-xxxxxxxx-fashion-b02-XXXXb407
Execution is still running on CAD-xxxxxBxxx01-and-mingle-pentest-001
But when I try it directly again with the command e.g.
salt CAD-xxxxxBxxx01-xxxxxxxx-and-golden-01-XXXXf7d5 grains.get id
CAD-xxxxxBxxx01-xxxxxxxx-and-golden-01-XXXXf7d5:
CAD-xxxxxBxxx01-xxxxxxxx-and-golden-01-XXXXf7d5
it again succeeds.
Thanks for reporting this. We have had similar issues in 2014.1. Can you upgrade to 2014.7?
Currently, we are running testing/production deployments on 2014.1.7 . There's no plan to upgrade to 2014.7 as there is no stable release yet. Is your 'similar' issue also reported?
This might be related to #13753. In particular, see this comment.
2014.7.0 has been stabilized several days ago and distro packages are available. We're waiting to resolve some minor deb package issues before announcing it, so hopefully you'll be able to try within the next few days. Thanks for working on this.
@dsumsky I'm guessing your on RHEL/CENTOS with those versions of zmq/python-zmq? We've seen similar behaviour in our deployment. Were upgrading to 2014.7.0 on monday, it's in EPEL and should be available for you to use.
My colleague upgraded affected Salt master to the latest stable SaltStack release 2014.7.0 and the issue is still not resolved.
Hey @dsumsky. An update from our side, we've upgraded all the minions and the master to use python-zmq14.X and zeromq4.x from the copr saltstack repo. All working a charm now, plan on moving to 2014.7.0 later this week still.
My colleague upgraded affected Salt master to the latest stable SaltStack release 2014.7.0 and the issue is still not resolved. Additionally, I have noticed that minions are not closing 'return' connections to the master's port 4506. There is approx. 400 'return' connections'. Any idea what may cause such "stale" connections? If I'm not mistaken a minion should return a job status over it but it shouldn't keep it established "forever". It should rather connect to the port on demand when has something to report.
@dsumsky We have definitely seen errors like this resolved by upgrading the version of ZMQ to 4.x and python-zmq to 14.x, as @Rucknar stated. Are you in a position to give that a try to see if that helps alleviate your issue?
I assume that whole infrastructure - masters and minions - should be ideally updated to the 0mq version 4, right? Otherwise the best version will be selected. That means if minion has 0mq 3 and master 4, then, version 4 will be used.
@dsumsky Yep, you need upgrade the zmq version on both masters and minions.
Hello, I'm going to close the issue as we found out that its primary cause is in the design of some states we use in the affected environment. I will reopen it once we get over it. Thanks.
@dsumsky Great! I am glad you figured out what the problem was and thank you for following up here.
I noticed a weird behavior of the command salt -v '*' test.ping
ran from Salt master where it is accepted approximately 300 of minions. The Salt master has set worker threads to 100. Salt masters versions report:
Output of the command:
This block of responses is still repeating and the command does never return. When I try to run test.ping directly against any of those minions (from the block above) it returns successfully:
The minions are running on version <= 2014.1.7. They are a mixture of Windows and Linux minions. I can't estimate how many minions are not running at the moment of the test but this IMHO shouldn't affect test.ping function behavior.
I performed one more test.ping test where I loop through all the accepted minions, one by one. It finally ends successfully on all minions (either pingable or 'did not return' but not infinite loop):
Can anybody try to explain me what may going on under the hood?