saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Install Salt from the Salt package repositories here:
https://docs.saltproject.io/salt/install-guide/en/latest/
Apache License 2.0
14.19k stars 5.48k forks source link

exhausting all the worker threads on the master with publish #27421

Closed mimianddaniel closed 8 years ago

mimianddaniel commented 9 years ago

running 2015.5.5, with master and syndics in the topology Based on my testing, if i have submit more events than the number of mworkers, I get the following error. I have already upped my mworkers numbers and tested at 128 threads. It looks like salt is able to handle ~128 events published at once but beyond that it gets the error below.

Submitted over ~300 publish events at once caused the following,

015-09-25 21:51:34,553 [salt.payload                             ][INFO    ][16240] SaltReqTimeoutError: after 15 seconds. (Try 1 of 3)
2015-09-25 21:51:35,136 [salt.payload                             ][INFO    ][15983] SaltReqTimeoutError: after 15 seconds. (Try 1 of 3)
<snip>
Traceback (most recent call last):
  File "/opt/blue-python/2.7/lib/python2.7/site-packages/salt/master.py", line 1460, in run_func
    ret = getattr(self, func)(load)
  File "/opt/blue-python/2.7/lib/python2.7/site-packages/salt/master.py", line 1391, in minion_pub
    return self.masterapi.minion_pub(clear_load)
  File "/opt/blue-python/2.7/lib/python2.7/site-packages/salt/daemons/masterapi.py", line 874, in minion_pub
    ret['jid'] = self.local.cmd_async(**pub_load)
  File "/opt/blue-python/2.7/lib/python2.7/site-packages/salt/client/__init__.py", line 333, in cmd_async
    **kwargs)
  File "/opt/blue-python/2.7/lib/python2.7/site-packages/salt/client/__init__.py", line 299, in run_job
    raise SaltClientError(general_exception)
SaltClientError: Salt request timed out. The master is not responding. If this error persists after verifying the master is up, worker_threads may need to be increased
thatch45 commented 8 years ago

well, it is a problem with at scale async code, corner cases have a lot of places to hide! We have been working with the multi master updated at customers well into the tens of thousands which we have been monitoring closely. Your call of course, I am still surprised at what you are seeing, but 2016.3 should be better.

DaveQB commented 8 years ago

@thatch45 I am assuming we'd have to install from pip or so for 2016.3 as there wouldn't be any deb packages yet.

If you are surprised, I am happy to keep digging at this to see what else we find could be the issue.

PS how much is paid support? PSS the event bus _stamps are around 12:50 am. They seem to be moving at real time, so it will never catch up, if it's meant to. Oh, maybe it is UTC time, we are +10 here and these stamps are on a 10 hour offset. 99% of these messages are "accept" public key messages.

Thanks again for your time Thomas

thatch45 commented 8 years ago

yes, that sounds like it is cycling multi master connections, we fixed one of them here: https://github.com/saltstack/salt/pull/33142 Send me over your email and I will get sales in touch with you, you have been a great member of the community and I am sorry that a bug has caused you so much trouble!

DaveQB commented 8 years ago

Yes it does. I capture the event messages and then had a look. The same minions are authing over and over. Here's an example: http://paste.ubuntu.com/16354664 It looks like about ever 50 seconds.

Thanks a lot @thatch45 Appreciate it. We have had a major issue each time we have upgraded (always confirming my nervousness around upgrading). But we've come through in the end; it keeps me busy :)

Will do.

DaveQB commented 8 years ago

@thatch45 Some minions are staying connected with master01 and have multiple masters listed in their config. They are also not in the salt events log I captured (not re-authing continuously)

Any recommendations as the best things to compare between "good" and "bad" minions?

thatch45 commented 8 years ago

That is odd, let me as @DmitryKuzmenko , he is the one has done most of the work on multi master. @DmitryKuzmenko does any of this sound familiar? minions switching back and forth between masters?

DaveQB commented 8 years ago

Thanks @thatch45 and @DmitryKuzmenko

cachedout commented 8 years ago

@DmitryKuzmenko can confirm, but asymmetry between multiple masters was an issue that was definitely present in earlier versions of Salt and I'd be reasonably confident in recommending 2016.3 as a remedy for that behavior.

DaveQB commented 8 years ago

@cachedout Yes I hit that in previous versions. There was a patch for 2014.7.5 that worked a treat. Talking with my boss and his boss and understandably they aren't keen to run RC code in production and are more keen to roll back. In the interim, I am going to ssh loop through all minions and change them to single master.

Just to recap, my original issue was exhausting all master worker_threads. I have since turned off our highstate cron jobs and also applied the patch in https://github.com/saltstack/salt/pull/33021

Then the mutli-master/reauth issue arose, like I have experienced before. I am unsure of the relationship (if any) between these symptoms.

cachedout commented 8 years ago

@DaveQB Understood. I was just commenting on that multi-master issue specifically. The concern about running RC code is, of course, completely valid. We're very close on a final release though, so hopefully that's at least good news. :]

DaveQB commented 8 years ago

@cachedout That would be good. Will this release allow a 2016.3 server talking to a 2015.8.8 minion? Some previous releases worked like that which allowed us a nice, gradual upgrade across the environment.

Thanks.

cachedout commented 8 years ago

@DaveQB Yes, that will be the case.

DaveQB commented 8 years ago

@cachedout WOOHOO! Err hmmm sorry for my burst of excitement.

DaveQB commented 8 years ago

Just a little update on this. I started the process of updating to 2016.8 last week, but too late to save us from an outage on the weekend. I have posted here: https://groups.google.com/forum/#!topic/salt-users/dyb0qY-2Efw

A summary is our single salt master hit a load average of 111, yes you read that right, 111. The salt-master service finally collapsed. I am not sure why this suddenly happened, but turning off our scheduled state.apply brought the load right down.

I would have thought 200 minions shouldn't be a problem for a c3.xlarge salt-master https://aws.amazon.com/ec2/instance-types/

I am quickly proceeding with the upgrade.

thatch45 commented 8 years ago

@DaveQB that does surprise me! But from your comments it looks like pillar is a likely culprit, while we did add quite a few enhancements to 2016.3 to improve worked performance we also added a pillar cache option which makes the master not regenerate the pillar every time but cache it for a little while, the PR for it is found here: https://github.com/saltstack/salt/pull/30686

Please forgive me if this has already been covered, unfortunately I have only had a moment to review your comments, if you need more help here please let us know!

DaveQB commented 8 years ago

Thanks for the fast response @thatch45

It was the schedule state.highstate pillar I removed that relieved the system resource pressures. I don't think it was specifically a pillar issue as we had a similar problem when I was running highstate through a cronjob and not until I removed that did the salt masters recover.

Don't worry about this for now @thatch45 This is an older version now. I'll get the upgrade to 2016.3 done and then enable scheduled highstates and if I have a problem I'll be sure to yell :)

Thanks.

DaveQB commented 8 years ago

Trying to setup a 2016.3.2 server for testing and it can't connect to a 2016.3.2 external minion. Yet the minion can connect to the master with a salt-call. Should I start a new Issue in github or a thread on google groups? I have salt event info.

Thanks. PS Oh I have had the SaltClientError: Salt request timed out. The master is not responding. If this error persists after verifying the master is up, worker_threads may need to be increased error on this new master. It has 1.6GB or ram and 4 minions trying to attach.

thatch45 commented 8 years ago

Yes, lets start a new issue. Have you tried to add the pillar cache options to the master?