saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Get access to the Salt software package repository here:
https://repo.saltproject.io/
Apache License 2.0
14.1k stars 5.47k forks source link

salt '*' test.ping command does not return #18050

Closed dsumsky closed 9 years ago

dsumsky commented 9 years ago

I noticed a weird behavior of the command salt -v '*' test.ping

ran from Salt master where it is accepted approximately 300 of minions. The Salt master has set worker threads to 100. Salt masters versions report:

salt --versions-report
           Salt: 2014.1.7
         Python: 2.6.6 (r266:84292, Sep  4 2013, 07:46:00)
         Jinja2: 2.2.1
       M2Crypto: 0.20.2
 msgpack-python: 0.1.13
   msgpack-pure: Not Installed
       pycrypto: 2.0.1
         PyYAML: 3.10
          PyZMQ: 2.2.0.1
            ZMQ: 3.2.4

Output of the command:

...
...
GOVCAD-xxxxxBxxx01-TRU-test-XXXXa410:
    True
GOVCAD-xxxxxBxxx01-and-GovCloud-0010-XXXXa2b7:
    True
CAD-xxxxxBxxx01-and-GovCloud-0010-XXXXda97:
    True
GOVCAD-xxxxxBxxx01-TRU-test-XXXXa410:
    True
GOVCAD-xxxxxBxxx01-and-GovCloud-0010-XXXXa2b7:
    True
GOVCAD-xxxxxBxxx01-TRU-test-XXXXa410:
    True
Execution is still running on CAD-xxxxxBxxx01-xxxxxxxx-and-golden-01-XXXXf7d5
Execution is still running on M3-xxxxxBxxxCUS01-xxxxxxxx-m3-b02-vpc2-XXXX16e5
Execution is still running on CAD-xxxxxBxxx01-and-GovCloud-0010-XXXXda97
Execution is still running on IMC-xxxxxBxxx01-xxxxxxxx-ind-epak1
Execution is still running on CAD-xxxxxBxxx01-xxxxxxxx-and-whitehat-test
Execution is still running on M3-xxxxxBxxxST02-xxxxxxxx-m3-b02-vpc2-XXXX16e5
Execution is still running on CAD-xxxxxBxxx01-approva-testing-01-XXXX85e2
Execution is still running on M3Fashion-xxxxxBxxxBE01-xxxxxxxx-fashion-b02-XXXXb407
Execution is still running on CAD-xxxxxBxxx01-and-mingle-pentest-001

Execution is still running on CAD-xxxxxBxxx01-xxxxxxxx-and-golden-01-XXXXf7d5
Execution is still running on M3-xxxxxBxxxCUS01-xxxxxxxx-m3-b02-vpc2-XXXX16e5
Execution is still running on CAD-xxxxxBxxx01-and-GovCloud-0010-XXXXda97
Execution is still running on IMC-xxxxxBxxx01-xxxxxxxx-ind-epak1
Execution is still running on CAD-xxxxxBxxx01-xxxxxxxx-and-whitehat-test
Execution is still running on M3-xxxxxBxxxST02-xxxxxxxx-m3-b02-vpc2-XXXX16e5
Execution is still running on CAD-xxxxxBxxx01-approva-testing-01-XXXX85e2
Execution is still running on M3Fashion-xxxxxBxxxBE01-xxxxxxxx-fashion-b02-XXXXb407
Execution is still running on CAD-xxxxxBxxx01-and-mingle-pentest-001

Execution is still running on CAD-xxxxxBxxx01-xxxxxxxx-and-golden-01-XXXXf7d5
Execution is still running on M3-xxxxxBxxxCUS01-xxxxxxxx-m3-b02-vpc2-XXXX16e5
Execution is still running on CAD-xxxxxBxxx01-and-GovCloud-0010-XXXXda97
Execution is still running on IMC-xxxxxBxxx01-xxxxxxxx-ind-epak1
Execution is still running on CAD-xxxxxBxxx01-xxxxxxxx-and-whitehat-test
Execution is still running on M3-xxxxxBxxxST02-xxxxxxxx-m3-b02-vpc2-XXXX16e5
Execution is still running on CAD-xxxxxBxxx01-approva-testing-01-XXXX85e2
Execution is still running on M3Fashion-xxxxxBxxxBE01-xxxxxxxx-fashion-b02-XXXXb407
Execution is still running on CAD-xxxxxBxxx01-and-mingle-pentest-001
...
...

This block of responses is still repeating and the command does never return. When I try to run test.ping directly against any of those minions (from the block above) it returns successfully:

salt CAD-xxxxxBxxx01-xxxxxxxx-and-golden-01-XXXXf7d5 test.ping
CAD-xxxxxBxxx01-xxxxxxxx-and-golden-01-XXXXf7d5:
   True

The minions are running on version <= 2014.1.7. They are a mixture of Windows and Linux minions. I can't estimate how many minions are not running at the moment of the test but this IMHO shouldn't affect test.ping function behavior.

I performed one more test.ping test where I loop through all the accepted minions, one by one. It finally ends successfully on all minions (either pingable or 'did not return' but not infinite loop):

for id in $(salt-key -l acc | cut -d' ' -f6 ); do salt -v $id test.ping; done

Executing job with jid 20141113060209837047
CAD-xxxxxBxxx01-and-GovCloud-0010-XXXXda97:
    True
Executing job with jid 20141113060210841938
CAD-xxxxxBxxx01-and-mingle-pentest-001:
    True
Executing job with jid 20141113060211191969
CAD-xxxxxBxxx01-ANDTEST01:
    Minion did not return
Executing job with jid 20141113060220084957
CAD-xxxxxBxxx01-approva-testing-01-XXXX85e2:
    True
Executing job with jid 20141113060220675951
CAD-xxxxxBxxx01-xxxxxxxx-and-epak3:
    Minion did not return
...
...

Can anybody try to explain me what may going on under the hood?

dsumsky commented 9 years ago

I can see the behavior with other functions, e.g. when I call from the master

salt '*' grains.get id

The command return minion IDs for a bunch of minions successfully but then get stuck in a loop waiting for the same minions like above:

Execution is still running on CAD-xxxxxBxxx01-xxxxxxxx-and-golden-01-XXXXf7d5
Execution is still running on M3-xxxxxBxxxCUS01-xxxxxxxx-m3-b02-vpc2-XXXX16e5
Execution is still running on CAD-xxxxxBxxx01-and-GovCloud-0010-XXXXda97
Execution is still running on IMC-xxxxxBxxx01-xxxxxxxx-ind-epak1
Execution is still running on CAD-xxxxxBxxx01-xxxxxxxx-and-whitehat-test
Execution is still running on M3-xxxxxBxxxST02-xxxxxxxx-m3-b02-vpc2-XXXX16e5
Execution is still running on CAD-xxxxxBxxx01-approva-testing-01-XXXX85e2
Execution is still running on M3Fashion-xxxxxBxxxBE01-xxxxxxxx-fashion-b02-XXXXb407
Execution is still running on CAD-xxxxxBxxx01-and-mingle-pentest-001

But when I try it directly again with the command e.g.

salt CAD-xxxxxBxxx01-xxxxxxxx-and-golden-01-XXXXf7d5 grains.get id
CAD-xxxxxBxxx01-xxxxxxxx-and-golden-01-XXXXf7d5:
    CAD-xxxxxBxxx01-xxxxxxxx-and-golden-01-XXXXf7d5

it again succeeds.

jfindlay commented 9 years ago

Thanks for reporting this. We have had similar issues in 2014.1. Can you upgrade to 2014.7?

dsumsky commented 9 years ago

Currently, we are running testing/production deployments on 2014.1.7 . There's no plan to upgrade to 2014.7 as there is no stable release yet. Is your 'similar' issue also reported?

jfindlay commented 9 years ago

This might be related to #13753. In particular, see this comment.

2014.7.0 has been stabilized several days ago and distro packages are available. We're waiting to resolve some minor deb package issues before announcing it, so hopefully you'll be able to try within the next few days. Thanks for working on this.

Rucknar commented 9 years ago

@dsumsky I'm guessing your on RHEL/CENTOS with those versions of zmq/python-zmq? We've seen similar behaviour in our deployment. Were upgrading to 2014.7.0 on monday, it's in EPEL and should be available for you to use.

dsumsky commented 9 years ago

My colleague upgraded affected Salt master to the latest stable SaltStack release 2014.7.0 and the issue is still not resolved.

Rucknar commented 9 years ago

Hey @dsumsky. An update from our side, we've upgraded all the minions and the master to use python-zmq14.X and zeromq4.x from the copr saltstack repo. All working a charm now, plan on moving to 2014.7.0 later this week still.

dsumsky commented 9 years ago

My colleague upgraded affected Salt master to the latest stable SaltStack release 2014.7.0 and the issue is still not resolved. Additionally, I have noticed that minions are not closing 'return' connections to the master's port 4506. There is approx. 400 'return' connections'. Any idea what may cause such "stale" connections? If I'm not mistaken a minion should return a job status over it but it shouldn't keep it established "forever". It should rather connect to the port on demand when has something to report.

rallytime commented 9 years ago

@dsumsky We have definitely seen errors like this resolved by upgrading the version of ZMQ to 4.x and python-zmq to 14.x, as @Rucknar stated. Are you in a position to give that a try to see if that helps alleviate your issue?

https://copr.fedoraproject.org/coprs/saltstack/

dsumsky commented 9 years ago

I assume that whole infrastructure - masters and minions - should be ideally updated to the 0mq version 4, right? Otherwise the best version will be selected. That means if minion has 0mq 3 and master 4, then, version 4 will be used.

rallytime commented 9 years ago

@dsumsky Yep, you need upgrade the zmq version on both masters and minions.

dsumsky commented 9 years ago

Hello, I'm going to close the issue as we found out that its primary cause is in the design of some states we use in the affected environment. I will reopen it once we get over it. Thanks.

rallytime commented 9 years ago

@dsumsky Great! I am glad you figured out what the problem was and thank you for following up here.