saltstack / salt

Software to automate the management and configuration of any infrastructure or application at scale. Get access to the Salt software package repository here:
https://repo.saltproject.io/
Apache License 2.0
14.11k stars 5.47k forks source link

exhausting all the worker threads on the master with publish #27421

Closed mimianddaniel closed 8 years ago

mimianddaniel commented 9 years ago

running 2015.5.5, with master and syndics in the topology Based on my testing, if i have submit more events than the number of mworkers, I get the following error. I have already upped my mworkers numbers and tested at 128 threads. It looks like salt is able to handle ~128 events published at once but beyond that it gets the error below.

Submitted over ~300 publish events at once caused the following,

015-09-25 21:51:34,553 [salt.payload                             ][INFO    ][16240] SaltReqTimeoutError: after 15 seconds. (Try 1 of 3)
2015-09-25 21:51:35,136 [salt.payload                             ][INFO    ][15983] SaltReqTimeoutError: after 15 seconds. (Try 1 of 3)
<snip>
Traceback (most recent call last):
  File "/opt/blue-python/2.7/lib/python2.7/site-packages/salt/master.py", line 1460, in run_func
    ret = getattr(self, func)(load)
  File "/opt/blue-python/2.7/lib/python2.7/site-packages/salt/master.py", line 1391, in minion_pub
    return self.masterapi.minion_pub(clear_load)
  File "/opt/blue-python/2.7/lib/python2.7/site-packages/salt/daemons/masterapi.py", line 874, in minion_pub
    ret['jid'] = self.local.cmd_async(**pub_load)
  File "/opt/blue-python/2.7/lib/python2.7/site-packages/salt/client/__init__.py", line 333, in cmd_async
    **kwargs)
  File "/opt/blue-python/2.7/lib/python2.7/site-packages/salt/client/__init__.py", line 299, in run_job
    raise SaltClientError(general_exception)
SaltClientError: Salt request timed out. The master is not responding. If this error persists after verifying the master is up, worker_threads may need to be increased
SEJeff commented 9 years ago

@grichmond-salt looks exactly like what we were seeing.

jfindlay commented 9 years ago

@mimianddaniel, @SEJeff, thanks for the report.

cachedout commented 9 years ago

@mimianddaniel or @SEJeff Could you please clarify what you mean by "at once"? Are you sending jobs via the client in a quickly-iterating loop or do you mean that you've targeted >= 300 minions in a single job? A simplified test case might help clarify.

mimianddaniel commented 9 years ago

Multiple clients like 40 calling multiple publish using fork, so the number of calls are around 200 or more at times.

It's not matter of target size afaict but more to do with lack of workers handling events in ret pub. I am wondering whether this is all blocking and not non-blocking for publish method On Sep 28, 2015 12:40, "Mike Place" notifications@github.com wrote:

@mimianddaniel https://github.com/mimianddaniel or @SEJeff https://github.com/SEJeff Could you please clarify what you mean by "at once"? Are you sending jobs via the client in a quickly-iterating loop or do you mean that you've targeted >= 300 minions in a single job? A simplified test case might help clarify.

— Reply to this email directly or view it on GitHub https://github.com/saltstack/salt/issues/27421#issuecomment-143853556.

cachedout commented 9 years ago

I wonder if it's the module blacklisting and auth routines that are blocking.

cachedout commented 9 years ago

Here's a simple way to replicate this (mostly for my benefit).

import salt.config
import salt.client
import threading

WORKERS = 1000

def publish():
    client = salt.client.LocalClient()
    ret = client.pub('*', 'test.ping')
    print(ret)

for worker in range(1, WORKERS):
    proc = threading.Thread(target=publish)
    proc.start()

With a master config that has worker_threads: 500.

Then the script above will produce errors such as:

Exception in thread Thread-702:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "./many_pubs.py", line 10, in publish
    ret = client.pub('silver', 'test.ping')
  File "/home/mp/devel/salt/salt/client/__init__.py", line 1443, in pub
    'Salt request timed out. The master is not responding. '
SaltReqTimeoutError:

@mimianddaniel and @SEJeff does that jive with what you're seeing?

mimianddaniel commented 9 years ago

yep - if the events are not that off from the number of threads configured, there is high latency from master to the targetted minions. Btw how long did it take for the master to start up with 500 workers?

On Mon, Sep 28, 2015 at 2:06 PM, Mike Place notifications@github.com wrote:

Here's a simple way to replicate this (mostly for my benefit).

import salt.config import salt.client import threading

WORKERS = 1000

def publish(): client = salt.client.LocalClient() ret = client.pub('*', 'test.ping') print(ret)

for worker in range(1, WORKERS): proc = threading.Thread(target=publish) proc.start()

With a master config that has worker_threads: 500.

Then the script above will produce errors such as:

Exception in thread Thread-702: Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 810, in *bootstrap_inner self.run() File "/usr/lib/python2.7/threading.py", line 763, in run self.__target(_self.__args, _self.__kwargs) File "./many_pubs.py", line 10, in publish ret = client.pub('silver', 'test.ping') File "/home/mp/devel/salt/salt/client/__init**.py", line 1443, in pub 'Salt request timed out. The master is not responding. ' SaltReqTimeoutError:```

@mimianddaniel and @SEJeff does that jive with what you're seeing?

— Reply to this email directly or view it on GitHub https://github.com/saltstack/salt/issues/27421#issuecomment-143874208.

cachedout commented 9 years ago

I should preface this by saying I've only confirmed the following on 2015.8, though I expect the general principle to apply in 2015.x.

What I think you're seeing here is a safety mechanism in the transport layer kicking in and really has nothing to do with the number of workers at all. (We should probably change that error message.)

I believe you're seeing these errors because the ZeroMQ socket buffer is congested and backing up the ZeroMQ sender socket. The timeout flag is used as a safety mechanism to ensure that in the event of full buffers, we'll return an exception back to the client informing it that resources were not available to send the message.

In my testing, setting a very high timeout values illustrates this theory perfectly and eliminates those errors. However, doing so may present another (potentially more serious) problem. A master or client may quickly run out of memory as tens of thousands of buffered messages are enqueued in user-space. There's no free lunch. :]

So, we have a few options here.

First, we could implement HWMs [High Water Marks] on the REQ sockets (at least) in addition to raising the timeouts. This should cause the sockets to block when too many messages are sent at once. This isn't great for a number of reasons, but mostly because it feels (at least to me) like having to re-implement the whole queue management layer ourselves, although it would provide a memory exhaustion safeguard.

Alternatively, you could slow down your sender, either across the board or in response to exceptions being returned. It's quite possible that some kernel tuning could influence things here, though in my testing, I could only really delay the buffer exhaustion and never eliminate it. YMMV.

I suppose we could look into putting some of this retry logic into LocalClient itself, maybe with a back-off. I'm not really convinced that's the right place for it, but there might be a good argument.

In reality, the proper approach is probably a little bit from each column. If there is a solid use case for raising the client timeout high enough that memory pressure becomes a concern, we can talk about slapping a HWM on the socket. Though, frankly, I think the easiest solution is just to slow down your client a little bit and retry failed jobs yourself.

OK, I think that's it for now. Thoughts?

cc: @SEJeff @mimianddaniel @jacksontj @thatch45 @basepi @terminalmage

jacksontj commented 9 years ago

In my local testing it doesn't seem to really be zeromq's fault (as much as I wish it was). The default HWM for sockets is 1k. If you go in and set them all to 1 there is a significant increase in errors-- but thats expected as the queue is really small. If you go set it to 0 (infinite) it does't solve all of your problems-- but it makes it better. There are basically 2 issues that end up with the same exception.

(1) HWM is hit if the REQ server on the master has no space in zmq it will error on the client which (for reasons of zmq) shows up as a timeout. In this case, increasing the HWM may help-- assuming that its just a spike in traffic-- not that the master is unable to keep up with load. This one is relatively easy to fix-- we simply need to make HWM configurable (I'll make a PR-- but I'm still ironing out exactly what the impact of each queue size is-- zmq is "fun")

(2) pub() message is in queue too long if the req server has a very large queue and is inundated with requests, it is completely possible that a message will sit in queue for a while-- which will exceed the timeout the client is willing to wait for, and the client will error out. One problem with this situation is that the client disconnects but the job stays in the queue for the master to process-- so timeouts don't stop the job from executing (same issue we have over in auth land-- its just a zmq ism, nothing to do about it :( ). That covers the client side of the issue, but the larger question is "why is my simple task taking so long!". Well, in isolation the pub() jobs (locally) are relatively hard to make timeout unless I increase the number of worker_threads to significantly higher than the number of processors I have. This is because the config worker_threads is a bit misleading, its actually processes. And when you create something like 500 processes for a machine with (presumably) less than a couple hundred cores you end up context switching a lot-- which isn't great. That being said, there are other reasons that people have increased the number of threads so high-- and why takes some explaining. The REQ side of the world is basically a process pool (think apache httpd process pool). There is a shared queue of tasks (in this example "publish this out"). Each worker will pull a single task, do it, and then move to the next one. This model works fine for CPU bound tasks-- as the process can't do anything else anyways. The problem occurs when that task is IO bound on something which is relatively slow (disk, etc.)-- the most common of which is salt's own fileserver (another large offender is the job_cache). The good news is that this should be going away once we finish coroutine-ing up the master-- we can enable the MWorkers to handle multiple requests in parallel within the same process to avoid this problem and better utilize the CPU (since we can work on CPU bound work while waiting on IO).

@cachedout if we want to add backoff or anything to the localclient make sure to keep in mind that not all timeouts mean the master won't execute the task.

As @cachedout mentioned, neither solution is complete. No matter how large the HWM is, if the MWorker doesn't pick it up within the timeout of the client it doesn't matter. And conversely if we pick them up as fast as possible but have too small of a HWM for bursts we'll also drop things.

So, for the original reporter of the issue-- @mimianddaniel I imagine you found this error on a box somewhere doing real work? If so, how many cores does your box have? And do you serve any large files?

BTW, for some more fun testing-- I added an event to get the clients to be closer to at the same time:

import salt.config
import salt.client
import threading

WORKERS = 500
event = threading.Event()

def publish():
    client = salt.client.LocalClient()
    event.wait()
    ret = client.pub('*', 'test.ping')
    # don't print if it worked correctly...
    if len(ret.get('minions', [])) > 0:
        return
    print(ret)

for worker in range(0, WORKERS):
    proc = threading.Thread(target=publish)
    proc.start()

event.set()
plastikos commented 9 years ago

I've been working on deployments that have ~8k minions and have seen some things that are tangentially related to this - but maybe not identical. I have a patch in my queue that I haven't submitted that adds a configuration for ZMQ pub in SaltEvent as well as EventPublisher (may need to add one in AsyncEventPublisher - haven't used the newer stuff). I have instrumented my libzmq so that I can monitor internal diagnostic information: Contexts, Sockets, Pipes, depth of pipes, how full pipes are, memory usage of messages in pipes, when HWM is triggered, etc..

I have some patches that I'm working on, but am not convinced I've completely characterized the problems I'm investigating:

Maybe I should start another ticket/discussion. FWIW.

plastikos commented 9 years ago

See #27606

SEJeff commented 9 years ago

@cachedout Yes, this looks exceptionally similar to what we were seeing in production in syndic clusters with ~1k - 1500 minions.

plastikos commented 9 years ago

It appears that ZMQ may not have the best HWM/LWM algorithm for high-inrush, high-volume sockets. This is a socket that may receive large batches of messages in a short period of time necessitating that many of the messages be queued in memory before they are serviced. As long as the influx isn't constant then the messages can be serviced and the socket emptied prior to the next inrush. For such a socket, we have the following:

The intention for the HWM/LWM gap is that it allows the pipe to empty (by reading) and provide some "head room" before the writer is re-enabled - we'll call this lagging between the thresholds hysteresis.

That is all quite appropriate. However there are more details:

Re-read that last bullet point and muse on it. It turns out that msgs_written in one pipe and msgs_read in the peer are simply counters - neither are a direct representations of how full the pipe is. The "fullness" can only be calculated correctly when the periodic notifications from the peer arrive and this is only accurate for the HWM calculation - the LWM calculation is really just a representation of how granular the peer notifications are and have nothing to do with actually dropping below the intended LWM compared to the HWM from the compute_lwm() calculation.

Imagine the following scenario:

I hope I'm mistaken about something in those steps. If I'm correct then I don't think this behavior is what was intended by the ZMQ authors.

What that analysis means is that the larger HWM gets the more infrequent the updates are for calculating the actual level and the more inaccurate the level calculation is. Unfortunately this problem is due to the peers_msgs_read update being tied to the LWM calculation - these should be separated and LWM should be calculated directly rather than just being a chunk-size dependent on how many messages are read in the peer pipe.

I have some ZMQ patches for this. I'm doing tests but I have some other problems that are causing me problems with testing. I'm rambling.

jacksontj commented 9 years ago

@plastikos I just looked over the code on libzqm, and it seems to do exactly what you described. From reading their comments, it seems that its somewhat intended:

    //  Given the 3. it would be good to keep HWM and LWM as far apart as
    //  possible to reduce the thread switching overhead to almost zero,
    //  say HWM-LWM should be max_wm_delta.

So, I'm not sure I necessarily disagree with their calculation of LWM (although I'd like an option to set it...) their calculation of when you can re-enable writes:

    if (lwm > 0 && msgs_read % lwm == 0)
        send_activate_write (peer, msgs_read);

Seems like a terrible idea.. because, as you stated, this means you won't re-enable writes until the consumer of the queue has read LWM number of messages-- and for queues with large HWMs thats a LOT of messages :/

With all of that, it seems the only way to workaround that behavior is to set HWM to 0, as the LWM will also be set to 0 (0.5 -> int should be 0) which should stop the whole HWM/LWM thing-- at the cose of unbounded memory consumption-- so it seems that zmq has left us a lose-lose situation here :/

Well, regardless I think we should probably expose the HWM settings to users, although I'm still not quite sure which ones are helpful. I was thinking the REQ ones (router and dealer) should just have the same values set-- since they are basically the same thing. And a second option for the MWorker incoming socket HWM-- seem reasonable?

plastikos commented 9 years ago

I disagree with your interpretation of their comment:

//  Given the 3. it would be good to keep HWM and LWM as far apart as
//  possible to reduce the thread switching overhead to almost zero,
//  say HWM-LWM should be max_wm_delta.

I also disagree with what ". . . as far apart as possible . . ." means when the HWM is significantly bigger than the default 1,000.

Let me do some clean-up on my ZMQ patches. When ZMQ doesn't work the way it should it is just as deserving of a patch as Salt is. ;-)

mimianddaniel commented 8 years ago

@jacksontj : 24 threads, 6 cores x 2 cpus. No large files. Speaking of IO i moved to ssd and i do some improvement in terms of amount of time takes for events to reach the minions while doing regression testing. I would be interested in upping the HWM instead of mworkers, assume this needs to be done from source code ATM? cachedout mentioned this as well and I do have concerns that massive number of messages queued up with HWM? and not returning timeout and which may eventually flood the minions and killing(OOM) the servers (depending on what modules its calling and how much mem is required to run certain module on the client) I think 0 inifinity is a bad idea for HWM or even really large number.

@plastikos ETA for the patches? :) or in devel already? especially interested in zeromq stat monitoring!!!

SEJeff commented 8 years ago

@mimianddaniel look at #27606

mimianddaniel commented 8 years ago

@SEJeff my bad.. earlier post by plastikos escaped me due to looking at the all the goodies posted :)

plastikos commented 8 years ago

I'm hoping to lob something over the fence this evening. I've got several irons in the fire. I'd like to test the ZMQ stuff in a few places before pushing them upstream to ZMQ.

There is already a "monitor" in ZMQ, but it is only lightly used in Salt. My diagnostic port stuff can send information from all of the various processes across all of the different hosts to a single collection point and graph according to memory usage or queue depth (fullness). Unfortunately some parts are hack-ish and some don't quite work so they need clean-up.

I'd really like to set ZMQ queue depths according to memory usage rather than message count.

Part of this I need to separate some of my R&D into distinct patches. I've been poking around and tweaking various things trying to understand what is really afoot. Some of the distinct patches should be quite clean and could be applied without waiting on any of my clean-up.

mimianddaniel commented 8 years ago

@plastikos zmq stat would be awesome for troubleshooting. Looking forward to testing it. LMK once it has been pushed

plastikos commented 8 years ago

See my libzmq:improvement-hwm_lwm_updates(*) branch for more correct ZMQ testing of the HWM and LWM. Unfortunately I have done very little testing since I extracted this from my ZMQ diagnostics patch. It's possible this ZMQ change and #27606 salt patch are the only ones you need to fix large minion number scaling.

(*) While my changes are against ZMQ master, they are trivial to back-port to ZMQ 4.x. If you need some assistance with this let me know.

plastikos commented 8 years ago

See #11865. My master with ~9k minions has memory leaks too. I'm currently testing patches to libzmq.

jacksontj commented 8 years ago

@plastikos for that leak (assuming its the publisher one) I filed an upstream ticket (https://github.com/zeromq/libzmq/issues/954) where I think we figured out where the leak was, patches of course will be appreciated-- we've been dealing with that leak for a while.

plastikos commented 8 years ago

@jacksontj, from what I can tell about zeromq/libzmq#954 there was never a fix. Am I correct?

jacksontj commented 8 years ago

Correct. On Oct 9, 2015 9:05 AM, "plastikos" notifications@github.com wrote:

@jacksontj https://github.com/jacksontj, from what I can tell about zeromq/libzmq#954 https://github.com/zeromq/libzmq/issues/954 there was never a fix. Am I correct?

— Reply to this email directly or view it on GitHub https://github.com/saltstack/salt/issues/27421#issuecomment-146912236.

cachedout commented 8 years ago

@plastikos What's the latest on your instrumented ZMQ? Would be great to play with this if you have something ready to go.

plastikos commented 8 years ago

My stack got pushed and I haven't quite popped back to this point. I'm working on a fix for incorrect aggregate return codes. I need to write a unit test for on PR and then a separate PR for the "fix". Hopefully I'll be back to this before the end of the week.

cachedout commented 8 years ago

@plastikos Super! Thanks.

cachedout commented 8 years ago

@mimianddaniel @SEJeff and @plastikos

I wanted to circle back around on this one. I have marked #27606 as a backport for the upcoming release of 2015.8 but I wanted to know if anybody has had a chance to try these changes in production?

mimianddaniel commented 8 years ago

I have not mostle due to the performance issues i encountered where large number minions didn't return while testing 2015.8 for me and still running 2015.5.4.

But really looking forward to this! On Jan 6, 2016 1:33 PM, "Mike Place" notifications@github.com wrote:

@mimianddaniel @SEJeff and @plastikos

I wanted to circle back around on this one. I have marked #27606 as a backport for the upcoming release of 2015.8 but I wanted to know if anybody has had a chance to try these changes in production?

— Reply to this email directly or view it on GitHub.

DaveQB commented 8 years ago

I just wanted to add, since upgrading to 2015.8.8.2 I am seeing this more.

My error message is:

SaltClientError: Salt request timed out. The master is not responding. If this error persists after verifying the master is up, worker_threads may need to be increased

This is using the python salt.client to run client.cmd mostly from a test.ping command.

thatch45 commented 8 years ago

@DaveQB we found what I think is the regression and this has solved it for us:

33021

Can you apply that patch and let me know if it fixes it? I know the PR says syndic, but thats just how we found it, this should help masters a great deal too

DaveQB commented 8 years ago

Hi @thatch45 Great! Sounds awesome.

I just applied that patch to my main master (does the minion.py patch need to be applied to all minions?)

My first test after applying the patch resulted in this:

salt 'prd-*' test.ping Salt request timed out. The master is not responding. If this error persists after verifying the master is up, worker_threads may need to be increased.

DaveQB commented 8 years ago

Additionally, we haver noticed that since upgrading to 201.5.8.8.2 that the '/var/cache/salt/master/jobs/' tree uses up all 976,000 inodes it has available to it on it's disk (we gave it a dedicated disk to try and solve this) within 80 minutes. So we have a cron job to delete everything in that tree older than 40 minutes to keep the server up. This sounds intertwined with this issue we are commenting in.

Thanks.

thatch45 commented 8 years ago

@DaveQB change the master config setting keep_jobs to 1, then salt will keep it clean down to one hour that patch only needs to be applied to the master and the master restarted. The minion.py patch is actually a patch to improve the performance of the syndic, so it does not seem apply here

DaveQB commented 8 years ago

Thanks @thatch45 I applied the patch the master and if anything, it seemed worse :(

I knew there was a setting for jobs, I hadn't looked it up yet :) Either way, should salt produce that many files? We didn't have this "running out of inodes" issue with 2014.7.5, only started when we upgraded to 2015.8.8.2. I haven't reported it as a bug yet, focusing on getting salt working (this issue) first.

Thanks again.

thatch45 commented 8 years ago

This makes me think that something in your setup is firing commands like crazy, do you have a reactor running? If that patch made things worse then the master is trying to work so hard that the change allowed it to get even more work done, and better max out the cpu as well.

DaveQB commented 8 years ago

Hi @thatch45 We run highstates on subsets of minions on an hourly basis using cronjobs. We've had it this way since the start (2012). I've been meaning to change them to the internal schedule system of salt.

I went ahead and applied the salt/minion.py part of the patch to all of my minions, but I have lost control of some of them it looks like. I am furiously investigating today. Thanks Thomas.

PS You might be right about the CPU issue:

Master CPU

thatch45 commented 8 years ago

Yes, this looks a lot like we have runaway job calls, the maxed out workers, the degraded performance with a patch that is made to get more out of the cpu, the minion patch should not make a difference on the minions, you don't need to bother with it.

Can you take a look at the event bus? You can filter based on new jobs and let me know if you think that there are too many new jobs firing. Some external system you have set up might be on a tight loop too.

thatch45 commented 8 years ago

https://docs.saltstack.com/en/latest/ref/runners/all/salt.runners.state.html#module-salt.runners.state are the docs on using the event bus watcher runner

thatch45 commented 8 years ago

if you see a crazy number of new job calls then that would explain all of this

DaveQB commented 8 years ago

Thanks @thatch45 You've given me good info to work with here. I'll get stuck into it and report back today. Thanks again.

DaveQB commented 8 years ago

Ignore the below, I have run out of inodes again.


Now I have run into this:

[CRITICAL] Could not deserialize msgpack message: Some exception handling minion payloadThis often happens when trying to read a file not in binary mode.Please open an issue and include the following error: unpack(b) recieved extra data. [ERROR ] Uncaught exception, closing connection. Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 407, in _run_callback callback(*args, **kwargs) File "/usr/lib/python2.7/dist-packages/tornado/stack_context.py", line 275, in null_wrapper return fn(*args, **kwargs) File "/usr/lib/python2.7/dist-packages/salt/transport/zeromq.py", line 815, in mark_future future.set_result(self.serial.loads(msg[0])) File "/usr/lib/python2.7/dist-packages/salt/payload.py", line 117, in loads return msgpack.loads(msg, use_list=True) File "/usr/lib/python2.7/dist-packages/msgpack/fallback.py", line 104, in unpackb raise ExtraData(ret, unpacker._fb_get_extradata()) ExtraData: unpack(b) recieved extra data. [ERROR ] Uncaught exception, closing connection.

I was testing to see which minions I still have etc. Not sure if this is related to this ticket or not.

thatch45 commented 8 years ago

Yes, this happens when zeromq starts dropping messages (since it chunks the messages outside of the size sent using the api), another indicator that you have runaway jobs being fired. Did you get a look at the event bus and the reactor config?

DaveQB commented 8 years ago

Hi @thatch45 I have managed to get all (I hope) of my minions back connected (after an issue with an ssh loop to patch them went awry). And I have turned off all highstate cronjobs so things have settled down now.

But now the master is getting "Minion did not return. [Not connected]" or "Salt request timed out. The master is not responding. If this error persists after verifying the master is up, worker_threads may need to be increased." for the vast majority of minions.

I have tried running a salt-call on these minions (limiting the master: line to just the main master in their minion config or using the --master switch) and that works fine though.

Watching the event bus I just see mostly public keys flying past from the minions. A lot of these messages have a _stamp of last night, some 10 hours ago....

Reactor config? I haven't played with reactor, is there a default config?

DaveQB commented 8 years ago

Ok I wrote that about an hour ago. I have been digging away at this concurrently to writing that message. What I think I am seeing is if I set the minions config to just our master01 server, they connect and happy days. If I add in master02 and master03 to their minion config, they drop off all of the masters and I lose them. I need to test this theory with other minions. And this certainly doesn't apply to all minions. Some are stable and not effected at all.

Should I move this to another (new) issue?

DaveQB commented 8 years ago

So I have 43 of my 176 minions connected to master01 and 22 connected to master02. This is reminding me of a bug I hit last time I did an upgrade (about a year ago) that saw minions all drop off when in a multi-master setup, but be ok when in a single master setup. I'll go dig up that issue...

DaveQB commented 8 years ago

Ref https://github.com/saltstack/salt/issues/17032 and here https://github.com/saltstack/salt/issues/21744

PS the salt event bus is now up to 12:37am last night (it's now 10:38am) with the _stamps of all of the auth messages.

thatch45 commented 8 years ago

That makes sense, this could be an issue with the multi master setup causing the minions to reconnect over and over again, which would also overwhelm the workers. We just put in a patch that may fix this. Also we have revamped multi master a lot in 2016.3, looks like we will have what should be the last rc of that tagged tomorrow

DaveQB commented 8 years ago

@thatch45 oh wow. Great news!! What do you suggest I do? Roll back to 2014.7.5 or move to 2016.3? Our environment is entirely Ubuntu server 12.04

Thank you. PS I am not sure why this took awhile to show up. I upgraded our company slowly over a 4 week period, first masters and playground servers, then dev servers, then test and then production. That wrapped up over a week ago.