Closed mimianddaniel closed 8 years ago
well, it is a problem with at scale async code, corner cases have a lot of places to hide! We have been working with the multi master updated at customers well into the tens of thousands which we have been monitoring closely. Your call of course, I am still surprised at what you are seeing, but 2016.3 should be better.
@thatch45 I am assuming we'd have to install from pip or so for 2016.3 as there wouldn't be any deb packages yet.
If you are surprised, I am happy to keep digging at this to see what else we find could be the issue.
PS how much is paid support? PSS the event bus _stamps are around 12:50 am. They seem to be moving at real time, so it will never catch up, if it's meant to. Oh, maybe it is UTC time, we are +10 here and these stamps are on a 10 hour offset. 99% of these messages are "accept" public key messages.
Thanks again for your time Thomas
yes, that sounds like it is cycling multi master connections, we fixed one of them here: https://github.com/saltstack/salt/pull/33142 Send me over your email and I will get sales in touch with you, you have been a great member of the community and I am sorry that a bug has caused you so much trouble!
Yes it does. I capture the event messages and then had a look. The same minions are authing over and over. Here's an example: http://paste.ubuntu.com/16354664 It looks like about ever 50 seconds.
Thanks a lot @thatch45 Appreciate it. We have had a major issue each time we have upgraded (always confirming my nervousness around upgrading). But we've come through in the end; it keeps me busy :)
Will do.
@thatch45 Some minions are staying connected with master01 and have multiple masters listed in their config. They are also not in the salt events log I captured (not re-authing continuously)
Any recommendations as the best things to compare between "good" and "bad" minions?
That is odd, let me as @DmitryKuzmenko , he is the one has done most of the work on multi master. @DmitryKuzmenko does any of this sound familiar? minions switching back and forth between masters?
Thanks @thatch45 and @DmitryKuzmenko
@DmitryKuzmenko can confirm, but asymmetry between multiple masters was an issue that was definitely present in earlier versions of Salt and I'd be reasonably confident in recommending 2016.3 as a remedy for that behavior.
@cachedout Yes I hit that in previous versions. There was a patch for 2014.7.5 that worked a treat. Talking with my boss and his boss and understandably they aren't keen to run RC code in production and are more keen to roll back. In the interim, I am going to ssh loop through all minions and change them to single master.
Just to recap, my original issue was exhausting all master worker_threads. I have since turned off our highstate cron jobs and also applied the patch in https://github.com/saltstack/salt/pull/33021
Then the mutli-master/reauth issue arose, like I have experienced before. I am unsure of the relationship (if any) between these symptoms.
@DaveQB Understood. I was just commenting on that multi-master issue specifically. The concern about running RC code is, of course, completely valid. We're very close on a final release though, so hopefully that's at least good news. :]
@cachedout That would be good. Will this release allow a 2016.3 server talking to a 2015.8.8 minion? Some previous releases worked like that which allowed us a nice, gradual upgrade across the environment.
Thanks.
@DaveQB Yes, that will be the case.
@cachedout WOOHOO! Err hmmm sorry for my burst of excitement.
Just a little update on this. I started the process of updating to 2016.8 last week, but too late to save us from an outage on the weekend. I have posted here: https://groups.google.com/forum/#!topic/salt-users/dyb0qY-2Efw
A summary is our single salt master hit a load average of 111, yes you read that right, 111. The salt-master service finally collapsed. I am not sure why this suddenly happened, but turning off our scheduled state.apply brought the load right down.
I would have thought 200 minions shouldn't be a problem for a c3.xlarge salt-master https://aws.amazon.com/ec2/instance-types/
I am quickly proceeding with the upgrade.
@DaveQB that does surprise me! But from your comments it looks like pillar is a likely culprit, while we did add quite a few enhancements to 2016.3 to improve worked performance we also added a pillar cache option which makes the master not regenerate the pillar every time but cache it for a little while, the PR for it is found here: https://github.com/saltstack/salt/pull/30686
Please forgive me if this has already been covered, unfortunately I have only had a moment to review your comments, if you need more help here please let us know!
Thanks for the fast response @thatch45
It was the schedule state.highstate pillar I removed that relieved the system resource pressures. I don't think it was specifically a pillar issue as we had a similar problem when I was running highstate through a cronjob and not until I removed that did the salt masters recover.
Don't worry about this for now @thatch45 This is an older version now. I'll get the upgrade to 2016.3 done and then enable scheduled highstates and if I have a problem I'll be sure to yell :)
Thanks.
Trying to setup a 2016.3.2 server for testing and it can't connect to a 2016.3.2 external minion. Yet the minion can connect to the master with a salt-call. Should I start a new Issue in github or a thread on google groups? I have salt event info.
Thanks.
PS Oh I have had the
SaltClientError: Salt request timed out. The master is not responding. If this error persists after verifying the master is up, worker_threads may need to be increased
error on this new master. It has 1.6GB or ram and 4 minions trying to attach.
Yes, lets start a new issue. Have you tried to add the pillar cache options to the master?
running 2015.5.5, with master and syndics in the topology Based on my testing, if i have submit more events than the number of mworkers, I get the following error. I have already upped my mworkers numbers and tested at 128 threads. It looks like salt is able to handle ~128 events published at once but beyond that it gets the error below.
Submitted over ~300 publish events at once caused the following,