Closed viniarck closed 5 months ago
@viniarck thanks for looking into this! One question please: why do we have limits on the buffers qsize ? DO you remember why we decided to change the maxsize from 0
to the default threadpool size?
Regarding the question on which Napp created those events, it was Topology, while loading the topology from DB. For each interface created, one event is publish (at least). I'm not sure if removing that process from setup would be a good idea, because loading the topology is something critical and needed. What do you think?
why do we have limits on the buffers qsize ? DO you remember why we decided to change the maxsize from 0 to the default threadpool size?
Yes. It boils down to being able to have a reponsive queuing system while not consuming too much memory (https://github.com/kytos-ng/kytos/pull/149, prior to it could consume as much memory as it wanted), even though pretty much we're having queue -> fanout pubsub. An unbounded queue, in the worst case, when publishers don/t or won't always have rate limit in the publishing side, if we were to use an unbounded queue it can lead to either a) starvation or b) too much delay where consumers can't keep up with the rate producers are sending events. Under normal circumstances on Kytos-ng after all NApps have loaded we shouldn't have too much queued messages for a long period of time. The starvation part though we've solved with PriorityQueue
, but as you know we're only using it for south bound OF message.
An unbounded queue will also consume as much memory as possible, and timeout won't work, so with an bounded queue at least it provides with the possibility to leverage a timeout in certain circumstances, which we haven't implemented yet, but will be addressed now. So, for instance any NApp publishing events during setup()
should be calling put(timeout=X)
and if it can't enqueue the event then we know that we have somewhere a misbehaving NApp or a threapool and/or queue size that needs to be ajusted. That couple with the fact that if a put(timeout=X)
isn't handled during a setup()
it'll be categorized as unhandled exception, which in turn will exit kytosd
based on the lastest changes, and that would clearly sign to network operators to see if threadpool needs to be increased or just queue size.
I'm not sure if removing that process from setup would be a good idea, because loading the topology is something critical and needed. What do you think?
Definitely, the idea is to maintain setup()
with a similar behavior, except still being able to handle when it gets stuck when trying to enqueue messages. Initially, I suggested to run in a Future
backed wiah a ThreadPool
but still synchronously to avoid conflicts with MainThread
and event loop that shares MainThread
, but that on its own also still needs to be coupled with a timeout when waiting for it. There's opportunity to even make it simpler with just a timeout on app.put(timeout=X)
as mentioned in the prior paragraph. I've also confirmed that just a timeout in the future without queue.put(timeout=x)
isn't sufficient either.
Let me know if that answer your questions @italovalcy, or if you have any other concerns or suggestions.
Side note: Also, speaking of RAM usage we still also need to further profile the core and NApps, we mapped an issue https://github.com/kytos-ng/kytos/issues/404 sometime ago that kytos container was using a large amount of memory. In the future, let's try to prioritize that and also if possible check out some RAM usag charts with the container in prod.
Let me know if that answer your questions @italovalcy, or if you have any other concerns or suggestions.
Although with a running task monitoring queue sizes based on how large thread pools are, then unbounded queues can theoretically work well too, and then when needed, it's a matter of adjusting the thread pool based on the current events work load (in a next kytosd
restart). Back then, RAM usage could linearly increase because it was two fold, a) both unbounded queue and b) unbounded threads; unbounded queue with thread pool can still be decent too (although bounded queue and thread pool is more conservative). I'll also try it out first and run some tests to confirm this hypothesis if it works well with unbounded queue and thread pool, then it's even less variables to tweak, simplifying for network operators too. Arguably, also unbounded queues work better with PriorityQueue
too. Either way though, a misbehaving high frequency publisher will require priorities to be set (in the future), but for the app
queue we still have plenty of room with just thread pool size alone assuming a base case that AmLight is using in prod for instance.
@italovalcy, we can also assess if it might be worth exposing queue/thread pool size usage data on a core endpoint. Although, I believe with log.warning
and the current elastic search setup it already gives great visibility, and as long as the task monitoring is reliable enough then it should be OK. But with an endpoint it could get polled from Zabbix. It's something to continue the discussion, let me know what you think and if its a stats that you might need.
I've managed to stress test locally publishing 1500 events per sec in the app buffer queue. I've explored the following cases to further confirm in practice how the two variables buffer queue size and thread pool size performed in terms of CPU and RAM usage and also overall kytosd responsiveness or any other side effect:
a) bounded app queue size 512, app thread pool size 512:
b) unbounded app queue, app thread pool size 512:
c) bounded app queue size 2048, app thread pool size 2048:
d) unbounded app queue, app thread pool size 2048:
a) and b) both have the same app thread pool executor size but with app queue size 512 and inf respectively. Comparing the charts unbounded queue leads to higher rate of memory usage over the first 20 secs, almost twice as much. Also at the peak usage, it was observed with unbounded queue that at their peak of resources usage it resulted in a side effect impacting the event loop on MainThread
temporarily losing connectivity with switches, whereas with bounded queues it remained consuming less memory while not having this side effect of losing connections. Bounded queues contributed to more stability since it's a more conservative approach since publishers eventually will get blocked as thread pools are doing their job. Also, the reached tps in the threadpool was close to 500 which is very close to the expected rate.
c) and d) same as the prior paragraph, except CPU usage was more spread out and high higher tps close to 1666 (a little bit less than the total threads available, probably due to high context switching overhead costs)
Using unbounded queue when you aren't always using PriorityQueue
(and all messages have been prioritized), and publishers aren't rate limited before putting to the queue can lead to more instability that has been observed in practice, including other side effects as impacting on the MainThread
and interfering with the asyncio
event loop that's handling critical IO operations.
A bounded queue, essentially is a sub case of an unbounded queue, so as long as the value is large enough you can also have room to accommodate more events while still not drastically consuming RAM usage. So, a bounded queue with a decent and configurable queue size value will provide the most flexibility while maximizing overall stability on Kytos-ng platform out of the gate.
setup()
will need to call with a timeout buffers.<queue>.put(timeout=x)
, and let the exception raise just so kytosd
will exit accordingly, and then you know out of the gate that you have NApps publishing too many events initially, and even later on during the runtime you need either a larger queue size and/or larger thread pool size too. After NApps are loaded, then it's not required to call with timeout unless you can't afford to get your publisher blocked until a new slot in the queue is available.meta
for core higher level events such as {self.username}/{self.name}.loaded
(including future ones) conn
bufferFuture (still needs to refine):
Let me know if you guys have any questions or suggestions.
The queue(s) task monitor will be delivered on https://github.com/kytos-ng/kytos/issues/439
In summary:
MainThread
can get blocked when enqueuing events if the queue capacity is reachedMainThread
is also the same thread loading core components, including shared with theasyncio
event loop, so it can end up blockingkytosd
to bootstrap correctly.Potential solutions (while ensuring events don't get lost during startup):
setup()
core|app_meta
or something along these lines and move a few events from the core such asnotify_loaded()
. This assumes thatMainThread
wouldn't get blocked by NApps.setup()
initialization not to run onMainThread
but on a worker thread while still synchronously waiting for it, and still , just soasyncio
event loop also don't get blocked when the core consumers get started.How to reproduce the issue
of_lldp
misbehaving publishing 512+ events during itssetup()
, at this pointMainThread
was completely blocked:cc'ing @italovalcy the hypothesis was confirmed that we discussed, indeed sounds exactly like you've described, let me know if you think it reflects what you've seen in prod. With APM, we should also be able to get some extra insights hopefully seeing which NApp might end up over publishing events during startup or if anything else blocking the
MainThread
with a lock.