Closed mimianddaniel closed 8 years ago
@grichmond-salt looks exactly like what we were seeing.
@mimianddaniel, @SEJeff, thanks for the report.
@mimianddaniel or @SEJeff Could you please clarify what you mean by "at once"? Are you sending jobs via the client in a quickly-iterating loop or do you mean that you've targeted >= 300 minions in a single job? A simplified test case might help clarify.
Multiple clients like 40 calling multiple publish using fork, so the number of calls are around 200 or more at times.
It's not matter of target size afaict but more to do with lack of workers handling events in ret pub. I am wondering whether this is all blocking and not non-blocking for publish method On Sep 28, 2015 12:40, "Mike Place" notifications@github.com wrote:
@mimianddaniel https://github.com/mimianddaniel or @SEJeff https://github.com/SEJeff Could you please clarify what you mean by "at once"? Are you sending jobs via the client in a quickly-iterating loop or do you mean that you've targeted >= 300 minions in a single job? A simplified test case might help clarify.
— Reply to this email directly or view it on GitHub https://github.com/saltstack/salt/issues/27421#issuecomment-143853556.
I wonder if it's the module blacklisting and auth routines that are blocking.
Here's a simple way to replicate this (mostly for my benefit).
import salt.config
import salt.client
import threading
WORKERS = 1000
def publish():
client = salt.client.LocalClient()
ret = client.pub('*', 'test.ping')
print(ret)
for worker in range(1, WORKERS):
proc = threading.Thread(target=publish)
proc.start()
With a master config that has worker_threads: 500
.
Then the script above will produce errors such as:
Exception in thread Thread-702:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "./many_pubs.py", line 10, in publish
ret = client.pub('silver', 'test.ping')
File "/home/mp/devel/salt/salt/client/__init__.py", line 1443, in pub
'Salt request timed out. The master is not responding. '
SaltReqTimeoutError:
@mimianddaniel and @SEJeff does that jive with what you're seeing?
yep - if the events are not that off from the number of threads configured, there is high latency from master to the targetted minions. Btw how long did it take for the master to start up with 500 workers?
On Mon, Sep 28, 2015 at 2:06 PM, Mike Place notifications@github.com wrote:
Here's a simple way to replicate this (mostly for my benefit).
import salt.config import salt.client import threading
WORKERS = 1000
def publish(): client = salt.client.LocalClient() ret = client.pub('*', 'test.ping') print(ret)
for worker in range(1, WORKERS): proc = threading.Thread(target=publish) proc.start()
With a master config that has worker_threads: 500.
Then the script above will produce errors such as:
Exception in thread Thread-702: Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 810, in *bootstrap_inner self.run() File "/usr/lib/python2.7/threading.py", line 763, in run self.__target(_self.__args, _self.__kwargs) File "./many_pubs.py", line 10, in publish ret = client.pub('silver', 'test.ping') File "/home/mp/devel/salt/salt/client/__init**.py", line 1443, in pub 'Salt request timed out. The master is not responding. ' SaltReqTimeoutError:```
@mimianddaniel and @SEJeff does that jive with what you're seeing?
— Reply to this email directly or view it on GitHub https://github.com/saltstack/salt/issues/27421#issuecomment-143874208.
I should preface this by saying I've only confirmed the following on 2015.8, though I expect the general principle to apply in 2015.x.
What I think you're seeing here is a safety mechanism in the transport layer kicking in and really has nothing to do with the number of workers at all. (We should probably change that error message.)
I believe you're seeing these errors because the ZeroMQ socket buffer is congested and backing up the ZeroMQ sender socket. The timeout
flag is used as a safety mechanism to ensure that in the event of full buffers, we'll return an exception back to the client informing it that resources were not available to send the message.
In my testing, setting a very high timeout values illustrates this theory perfectly and eliminates those errors. However, doing so may present another (potentially more serious) problem. A master or client may quickly run out of memory as tens of thousands of buffered messages are enqueued in user-space. There's no free lunch. :]
So, we have a few options here.
First, we could implement HWMs [High Water Marks] on the REQ sockets (at least) in addition to raising the timeouts. This should cause the sockets to block when too many messages are sent at once. This isn't great for a number of reasons, but mostly because it feels (at least to me) like having to re-implement the whole queue management layer ourselves, although it would provide a memory exhaustion safeguard.
Alternatively, you could slow down your sender, either across the board or in response to exceptions being returned. It's quite possible that some kernel tuning could influence things here, though in my testing, I could only really delay the buffer exhaustion and never eliminate it. YMMV.
I suppose we could look into putting some of this retry logic into LocalClient itself, maybe with a back-off. I'm not really convinced that's the right place for it, but there might be a good argument.
In reality, the proper approach is probably a little bit from each column. If there is a solid use case for raising the client timeout high enough that memory pressure becomes a concern, we can talk about slapping a HWM on the socket. Though, frankly, I think the easiest solution is just to slow down your client a little bit and retry failed jobs yourself.
OK, I think that's it for now. Thoughts?
cc: @SEJeff @mimianddaniel @jacksontj @thatch45 @basepi @terminalmage
In my local testing it doesn't seem to really be zeromq's fault (as much as I wish it was). The default HWM for sockets is 1k. If you go in and set them all to 1 there is a significant increase in errors-- but thats expected as the queue is really small. If you go set it to 0 (infinite) it does't solve all of your problems-- but it makes it better. There are basically 2 issues that end up with the same exception.
(1) HWM is hit if the REQ server on the master has no space in zmq it will error on the client which (for reasons of zmq) shows up as a timeout. In this case, increasing the HWM may help-- assuming that its just a spike in traffic-- not that the master is unable to keep up with load. This one is relatively easy to fix-- we simply need to make HWM configurable (I'll make a PR-- but I'm still ironing out exactly what the impact of each queue size is-- zmq is "fun")
(2) pub() message is in queue too long
if the req server has a very large queue and is inundated with requests, it is completely possible that a message will sit in queue for a while-- which will exceed the timeout the client is willing to wait for, and the client will error out. One problem with this situation is that the client disconnects but the job stays in the queue for the master to process-- so timeouts don't stop the job from executing (same issue we have over in auth land-- its just a zmq ism, nothing to do about it :( ). That covers the client side of the issue, but the larger question is "why is my simple task taking so long!". Well, in isolation the pub() jobs (locally) are relatively hard to make timeout unless I increase the number of worker_threads to significantly higher than the number of processors I have. This is because the config worker_threads
is a bit misleading, its actually processes. And when you create something like 500 processes for a machine with (presumably) less than a couple hundred cores you end up context switching a lot-- which isn't great. That being said, there are other reasons that people have increased the number of threads so high-- and why takes some explaining. The REQ side of the world is basically a process pool (think apache httpd process pool). There is a shared queue of tasks (in this example "publish this out"). Each worker will pull a single task, do it, and then move to the next one. This model works fine for CPU bound tasks-- as the process can't do anything else anyways. The problem occurs when that task is IO bound on something which is relatively slow (disk, etc.)-- the most common of which is salt's own fileserver (another large offender is the job_cache). The good news is that this should be going away once we finish coroutine-ing up the master-- we can enable the MWorkers to handle multiple requests in parallel within the same process to avoid this problem and better utilize the CPU (since we can work on CPU bound work while waiting on IO).
@cachedout if we want to add backoff or anything to the localclient make sure to keep in mind that not all timeouts mean the master won't execute the task.
As @cachedout mentioned, neither solution is complete. No matter how large the HWM is, if the MWorker doesn't pick it up within the timeout of the client it doesn't matter. And conversely if we pick them up as fast as possible but have too small of a HWM for bursts we'll also drop things.
So, for the original reporter of the issue-- @mimianddaniel I imagine you found this error on a box somewhere doing real work? If so, how many cores does your box have? And do you serve any large files?
BTW, for some more fun testing-- I added an event to get the clients to be closer to at the same time:
import salt.config
import salt.client
import threading
WORKERS = 500
event = threading.Event()
def publish():
client = salt.client.LocalClient()
event.wait()
ret = client.pub('*', 'test.ping')
# don't print if it worked correctly...
if len(ret.get('minions', [])) > 0:
return
print(ret)
for worker in range(0, WORKERS):
proc = threading.Thread(target=publish)
proc.start()
event.set()
I've been working on deployments that have ~8k minions and have seen some things that are tangentially related to this - but maybe not identical. I have a patch in my queue that I haven't submitted that adds a configuration for ZMQ pub in SaltEvent as well as EventPublisher (may need to add one in AsyncEventPublisher - haven't used the newer stuff). I have instrumented my libzmq so that I can monitor internal diagnostic information: Contexts, Sockets, Pipes, depth of pipes, how full pipes are, memory usage of messages in pipes, when HWM is triggered, etc..
I have some patches that I'm working on, but am not convinced I've completely characterized the problems I'm investigating:
Maybe I should start another ticket/discussion. FWIW.
See #27606
@cachedout Yes, this looks exceptionally similar to what we were seeing in production in syndic clusters with ~1k - 1500 minions.
It appears that ZMQ may not have the best HWM
/LWM
algorithm for high-inrush, high-volume sockets. This is a socket that may receive large batches of messages in a short period of time necessitating that many of the messages be queued in memory before they are serviced. As long as the influx isn't constant then the messages can be serviced and the socket emptied prior to the next inrush. For such a socket, we have the following:
HWM
) is when the pipe is full and cannot queue more messages.HWM
causes message sending to either block or be dropped depending on pipe type (in Salt's case they are dropped).HWM
causes the writer into the queue to be disabled.HWM
is 1,000msgs_written - peers_msgs_read
(the differenc between how many messages have been written in and how many have been read out)HWM
to test if it is full.(HWM + 1)/2
for HWM < 2 * max_wm_delta
(small queues - compute_lwm()
)HWM - max_wm_delta
anything else (large queues - compute_lwm()
)max_wm_delta
is hard-coded at compile time to 1024
The intention for the HWM/LWM gap is that it allows the pipe to empty (by reading) and provide some "head room" before the writer is re-enabled - we'll call this lagging between the thresholds hysteresis.
That is all quite appropriate. However there are more details:
peers_msgs_read
for any given pipe is unknown because it depends on msgs_read
from the peer pipe is doing. peers_msgs_read
is periodically sent to the pipe that needs it.peers_msgs_read
is only sent when msgs_read % LWM == 0
.Re-read that last bullet point and muse on it. It turns out that msgs_written
in one pipe and msgs_read
in the peer are simply counters - neither are a direct representations of how full the pipe is. The "fullness" can only be calculated correctly when the periodic notifications from the peer arrive and this is only accurate for the HWM
calculation - the LWM
calculation is really just a representation of how granular the peer notifications are and have nothing to do with actually dropping below the intended LWM
compared to the HWM
from the compute_lwm()
calculation.
Imagine the following scenario:
HWM == 20,000
LWM == 18,976
1) An influx of 18,000 messages arrive and are queued in the in-bound pipe
2) msgs_written = 18,000
; peers_msgs_read = 0
3) The peer pipe reads the messages
4) The peer pipe has msgs_read = 18,000
5) The in-bound pipe of #1 is effectively empty, but doesn't know it because the peer pipe has not reached msgs_read % 18,976 == 0
which would trigger the update notification.
6) Now another influx of 5,000 messages arrive in the in-bound pipe.
7) msgs_written = 20,000
; peers_msgs_read = 0
, HWM
is tripped and 3,000 messages are dropped
8) Once the peer pipe reads 976 messages it will send an activate_write
9) the in-bound pipe receives the activate_write
and msgs_written = 20,000
; peers_msgs_read = 18,976
and new messages are no longer dropped.
I hope I'm mistaken about something in those steps. If I'm correct then I don't think this behavior is what was intended by the ZMQ authors.
What that analysis means is that the larger HWM
gets the more infrequent the updates are for calculating the actual level and the more inaccurate the level calculation is. Unfortunately this problem is due to the peers_msgs_read
update being tied to the LWM
calculation - these should be separated and LWM
should be calculated directly rather than just being a chunk-size dependent on how many messages are read in the peer pipe.
I have some ZMQ patches for this. I'm doing tests but I have some other problems that are causing me problems with testing. I'm rambling.
@plastikos I just looked over the code on libzqm, and it seems to do exactly what you described. From reading their comments, it seems that its somewhat intended:
// Given the 3. it would be good to keep HWM and LWM as far apart as
// possible to reduce the thread switching overhead to almost zero,
// say HWM-LWM should be max_wm_delta.
So, I'm not sure I necessarily disagree with their calculation of LWM (although I'd like an option to set it...) their calculation of when you can re-enable writes:
if (lwm > 0 && msgs_read % lwm == 0)
send_activate_write (peer, msgs_read);
Seems like a terrible idea.. because, as you stated, this means you won't re-enable writes until the consumer of the queue has read LWM number of messages-- and for queues with large HWMs thats a LOT of messages :/
With all of that, it seems the only way to workaround that behavior is to set HWM to 0, as the LWM will also be set to 0 (0.5 -> int should be 0) which should stop the whole HWM/LWM thing-- at the cose of unbounded memory consumption-- so it seems that zmq has left us a lose-lose situation here :/
Well, regardless I think we should probably expose the HWM settings to users, although I'm still not quite sure which ones are helpful. I was thinking the REQ ones (router and dealer) should just have the same values set-- since they are basically the same thing. And a second option for the MWorker incoming socket HWM-- seem reasonable?
I disagree with your interpretation of their comment:
// Given the 3. it would be good to keep HWM and LWM as far apart as
// possible to reduce the thread switching overhead to almost zero,
// say HWM-LWM should be max_wm_delta.
I also disagree with what ". . . as far apart as possible . . ." means when the HWM is significantly bigger than the default 1,000.
Let me do some clean-up on my ZMQ patches. When ZMQ doesn't work the way it should it is just as deserving of a patch as Salt is. ;-)
@jacksontj : 24 threads, 6 cores x 2 cpus. No large files. Speaking of IO i moved to ssd and i do some improvement in terms of amount of time takes for events to reach the minions while doing regression testing. I would be interested in upping the HWM instead of mworkers, assume this needs to be done from source code ATM? cachedout mentioned this as well and I do have concerns that massive number of messages queued up with HWM? and not returning timeout and which may eventually flood the minions and killing(OOM) the servers (depending on what modules its calling and how much mem is required to run certain module on the client) I think 0 inifinity is a bad idea for HWM or even really large number.
@plastikos ETA for the patches? :) or in devel already? especially interested in zeromq stat monitoring!!!
@mimianddaniel look at #27606
@SEJeff my bad.. earlier post by plastikos escaped me due to looking at the all the goodies posted :)
I'm hoping to lob something over the fence this evening. I've got several irons in the fire. I'd like to test the ZMQ stuff in a few places before pushing them upstream to ZMQ.
There is already a "monitor" in ZMQ, but it is only lightly used in Salt. My diagnostic port stuff can send information from all of the various processes across all of the different hosts to a single collection point and graph according to memory usage or queue depth (fullness). Unfortunately some parts are hack-ish and some don't quite work so they need clean-up.
I'd really like to set ZMQ queue depths according to memory usage rather than message count.
Part of this I need to separate some of my R&D into distinct patches. I've been poking around and tweaking various things trying to understand what is really afoot. Some of the distinct patches should be quite clean and could be applied without waiting on any of my clean-up.
@plastikos zmq stat would be awesome for troubleshooting. Looking forward to testing it. LMK once it has been pushed
See my libzmq:improvement-hwm_lwm_updates
(*) branch for more correct ZMQ testing of the HWM and LWM. Unfortunately I have done very little testing since I extracted this from my ZMQ diagnostics patch. It's possible this ZMQ change and #27606 salt patch are the only ones you need to fix large minion number scaling.
(*) While my changes are against ZMQ master, they are trivial to back-port to ZMQ 4.x. If you need some assistance with this let me know.
See #11865. My master with ~9k minions has memory leaks too. I'm currently testing patches to libzmq.
@plastikos for that leak (assuming its the publisher one) I filed an upstream ticket (https://github.com/zeromq/libzmq/issues/954) where I think we figured out where the leak was, patches of course will be appreciated-- we've been dealing with that leak for a while.
@jacksontj, from what I can tell about zeromq/libzmq#954 there was never a fix. Am I correct?
Correct. On Oct 9, 2015 9:05 AM, "plastikos" notifications@github.com wrote:
@jacksontj https://github.com/jacksontj, from what I can tell about zeromq/libzmq#954 https://github.com/zeromq/libzmq/issues/954 there was never a fix. Am I correct?
— Reply to this email directly or view it on GitHub https://github.com/saltstack/salt/issues/27421#issuecomment-146912236.
@plastikos What's the latest on your instrumented ZMQ? Would be great to play with this if you have something ready to go.
My stack got pushed and I haven't quite popped back to this point. I'm working on a fix for incorrect aggregate return codes. I need to write a unit test for on PR and then a separate PR for the "fix". Hopefully I'll be back to this before the end of the week.
@plastikos Super! Thanks.
@mimianddaniel @SEJeff and @plastikos
I wanted to circle back around on this one. I have marked #27606 as a backport for the upcoming release of 2015.8 but I wanted to know if anybody has had a chance to try these changes in production?
I have not mostle due to the performance issues i encountered where large number minions didn't return while testing 2015.8 for me and still running 2015.5.4.
But really looking forward to this! On Jan 6, 2016 1:33 PM, "Mike Place" notifications@github.com wrote:
@mimianddaniel @SEJeff and @plastikos
I wanted to circle back around on this one. I have marked #27606 as a backport for the upcoming release of 2015.8 but I wanted to know if anybody has had a chance to try these changes in production?
— Reply to this email directly or view it on GitHub.
I just wanted to add, since upgrading to 2015.8.8.2 I am seeing this more.
My error message is:
SaltClientError: Salt request timed out. The master is not responding. If this error persists after verifying the master is up, worker_threads may need to be increased
This is using the python salt.client to run client.cmd mostly from a test.ping command.
@DaveQB we found what I think is the regression and this has solved it for us:
Can you apply that patch and let me know if it fixes it? I know the PR says syndic, but thats just how we found it, this should help masters a great deal too
Hi @thatch45 Great! Sounds awesome.
I just applied that patch to my main master (does the minion.py patch need to be applied to all minions?)
My first test after applying the patch resulted in this:
salt 'prd-*' test.ping Salt request timed out. The master is not responding. If this error persists after verifying the master is up, worker_threads may need to be increased.
Additionally, we haver noticed that since upgrading to 201.5.8.8.2 that the '/var/cache/salt/master/jobs/' tree uses up all 976,000 inodes it has available to it on it's disk (we gave it a dedicated disk to try and solve this) within 80 minutes. So we have a cron job to delete everything in that tree older than 40 minutes to keep the server up. This sounds intertwined with this issue we are commenting in.
Thanks.
@DaveQB change the master config setting keep_jobs
to 1
, then salt will keep it clean down to one hour
that patch only needs to be applied to the master and the master restarted. The minion.py patch is actually a patch to improve the performance of the syndic, so it does not seem apply here
Thanks @thatch45 I applied the patch the master and if anything, it seemed worse :(
I knew there was a setting for jobs, I hadn't looked it up yet :) Either way, should salt produce that many files? We didn't have this "running out of inodes" issue with 2014.7.5, only started when we upgraded to 2015.8.8.2. I haven't reported it as a bug yet, focusing on getting salt working (this issue) first.
Thanks again.
This makes me think that something in your setup is firing commands like crazy, do you have a reactor running? If that patch made things worse then the master is trying to work so hard that the change allowed it to get even more work done, and better max out the cpu as well.
Hi @thatch45 We run highstates on subsets of minions on an hourly basis using cronjobs. We've had it this way since the start (2012). I've been meaning to change them to the internal schedule system of salt.
I went ahead and applied the salt/minion.py part of the patch to all of my minions, but I have lost control of some of them it looks like. I am furiously investigating today. Thanks Thomas.
PS You might be right about the CPU issue:
Yes, this looks a lot like we have runaway job calls, the maxed out workers, the degraded performance with a patch that is made to get more out of the cpu, the minion patch should not make a difference on the minions, you don't need to bother with it.
Can you take a look at the event bus? You can filter based on new jobs and let me know if you think that there are too many new jobs firing. Some external system you have set up might be on a tight loop too.
https://docs.saltstack.com/en/latest/ref/runners/all/salt.runners.state.html#module-salt.runners.state are the docs on using the event bus watcher runner
if you see a crazy number of new job calls then that would explain all of this
Thanks @thatch45 You've given me good info to work with here. I'll get stuck into it and report back today. Thanks again.
Ignore the below, I have run out of inodes again.
Now I have run into this:
[CRITICAL] Could not deserialize msgpack message: Some exception handling minion payloadThis often happens when trying to read a file not in binary mode.Please open an issue and include the following error: unpack(b) recieved extra data. [ERROR ] Uncaught exception, closing connection. Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 407, in _run_callback callback(*args, **kwargs) File "/usr/lib/python2.7/dist-packages/tornado/stack_context.py", line 275, in null_wrapper return fn(*args, **kwargs) File "/usr/lib/python2.7/dist-packages/salt/transport/zeromq.py", line 815, in mark_future future.set_result(self.serial.loads(msg[0])) File "/usr/lib/python2.7/dist-packages/salt/payload.py", line 117, in loads return msgpack.loads(msg, use_list=True) File "/usr/lib/python2.7/dist-packages/msgpack/fallback.py", line 104, in unpackb raise ExtraData(ret, unpacker._fb_get_extradata()) ExtraData: unpack(b) recieved extra data. [ERROR ] Uncaught exception, closing connection.
I was testing to see which minions I still have etc. Not sure if this is related to this ticket or not.
Yes, this happens when zeromq starts dropping messages (since it chunks the messages outside of the size sent using the api), another indicator that you have runaway jobs being fired. Did you get a look at the event bus and the reactor config?
Hi @thatch45 I have managed to get all (I hope) of my minions back connected (after an issue with an ssh loop to patch them went awry). And I have turned off all highstate cronjobs so things have settled down now.
But now the master is getting "Minion did not return. [Not connected]" or "Salt request timed out. The master is not responding. If this error persists after verifying the master is up, worker_threads may need to be increased." for the vast majority of minions.
I have tried running a salt-call on these minions (limiting the master: line to just the main master in their minion config or using the --master switch) and that works fine though.
Watching the event bus I just see mostly public keys flying past from the minions. A lot of these messages have a _stamp of last night, some 10 hours ago....
Reactor config? I haven't played with reactor, is there a default config?
Ok I wrote that about an hour ago. I have been digging away at this concurrently to writing that message. What I think I am seeing is if I set the minions config to just our master01 server, they connect and happy days. If I add in master02 and master03 to their minion config, they drop off all of the masters and I lose them. I need to test this theory with other minions. And this certainly doesn't apply to all minions. Some are stable and not effected at all.
Should I move this to another (new) issue?
So I have 43 of my 176 minions connected to master01 and 22 connected to master02. This is reminding me of a bug I hit last time I did an upgrade (about a year ago) that saw minions all drop off when in a multi-master setup, but be ok when in a single master setup. I'll go dig up that issue...
Ref https://github.com/saltstack/salt/issues/17032 and here https://github.com/saltstack/salt/issues/21744
PS the salt event bus is now up to 12:37am last night (it's now 10:38am) with the _stamps of all of the auth messages.
That makes sense, this could be an issue with the multi master setup causing the minions to reconnect over and over again, which would also overwhelm the workers. We just put in a patch that may fix this. Also we have revamped multi master a lot in 2016.3, looks like we will have what should be the last rc of that tagged tomorrow
@thatch45 oh wow. Great news!! What do you suggest I do? Roll back to 2014.7.5 or move to 2016.3? Our environment is entirely Ubuntu server 12.04
Thank you. PS I am not sure why this took awhile to show up. I upgraded our company slowly over a 4 week period, first masters and playground servers, then dev servers, then test and then production. That wrapped up over a week ago.
running 2015.5.5, with master and syndics in the topology Based on my testing, if i have submit more events than the number of mworkers, I get the following error. I have already upped my mworkers numbers and tested at 128 threads. It looks like salt is able to handle ~128 events published at once but beyond that it gets the error below.
Submitted over ~300 publish events at once caused the following,