Closed nyczol closed 4 years ago
@nyczol Please sign the Contributor License Agreement!
Click here to manually synchronize the status of this Pull Request.
See the FAQ for frequently asked questions.
@nyczol Thank you for signing the Contributor License Agreement!
Are there any metrics/numbers you can share that demonstrate the effect?
Issue is indeterministic and occurs on OCS system that processes AAA requests (RADIUS/diameter). Some results are publish to RabbitMQ by erlang client (version 3.7.9). From time to time we can see that rabbit_writer process memory is growing to many GB (process mailbox contains milions of messages) and erlang:process_info(P, current_stacktrace) reports that process is executing rabbit_writer:maybe_gc_large_msg/erlang:garbage_collect. RabbitMQ server is not overloaded. On client site load is about 300 messages per seconds per rabbit_writer process.
We are going to increase gc threshold a lot or even disable gc at all. It is possible to add this PR to 3.7 and 3.8 branches?
From time to time we can see that rabbit_writer process memory is growing to many GB (process mailbox contains milions of messages)
This suggests a network issue, or your client applications aren't keeping up.
Please use a tool like netstat
(or similar) to check for high TCP Send-Q
values during one of these events.
Please note that by defaut GC is executed every ~1MB. For messages that have 1kB size and load 100 messages per seconds GC is executed every 10 sec. In general frequency of GC execution depends on client load. With load 1000 messages per seconds (message size 1kB) GC is executed every 1 second (!!!). We must also take into account that execution time for erlang:garbage_collect/0 is O(N) (or worst) when N is process memory. Please also check erlang doc for garbage_collect: Forces an immediate garbage collection of the executing process. The function is not to be used unless it has been noticed (or there are good reasons to suspect) that the spontaneous garbage collection will occur too late or not at all. Warning! Improper use can seriously degrade system performance. Now I would like to explain how this gc execution cause indeterministic performance degradation in erlang client: Let's assume that average load on client is N messages per seconds and it takes 500ms to send all these messages. Now lets assume that average GC execution time is 100ms. With such load client works properly. Now lets assume some message peaks 2*N messages, processing time for this peaks is 1sec but also gc must be executed so not all messages can be processes. Some stay in rabbit_writer mailbox. More messages in mail box more time is consumed for GC and less time left for real processing (writing messages to socket). Finally when we reach some critical mailbox size rabbit_writer process on client site stays in overload state forever (memory is still growing). Increase gc thresold will help to optimaise client performance in case of very high load and will protect rabbbit_writer process against overload.
@nyczol we would really like to see some metrics that demonstrate the effect of this. As @lukebakken pointed out already, a large rabbit_writer
heap is usually an indication of the fact that it cannot send protocol methods fast enough.
We won't be merging, leave alone backporting this to 3.8 and 3.7, without evidence.
These unfortunate GC calls had to be added because Erlang's binary heap collection is not the most predictable thing in the world and with really large messages, really large binaries stayed on the heap for a long time. This was introduced several years ago on a different version of OTP. Since then, max message size has been reduced from 2 GB to 128 MB (by default) in 3.8. So maybe it's less of an issue today on OTP 22.
Nonetheless, we need cold hard data on the effects of this change both for your workload and for a workload that publishes messages of 128 MB in size with a high enough rate.
I'm thinking this process could benefit from using process_flag(message_queue_data, off_heap)
so that any messages in the queue aren't scanned during GC. For performance we could also consider forcing a minor
gc instead of a major
(erlang:garbage_collect(self(), [{type, minor}])
) which may be sufficient to collect transient binrefs.
@michaelklishin we would really like to see some metrics that demonstrate the effect of this
What kind of metrics would you like to see exectly?
@lukebakken This suggests a network issue, or your client applications aren't keeping up.
Yes exaclty from time to time when message peak occurs my client app aren't keeping up. But for most time it is able to send all message without any problem. Please note that GC issue we are currently talking about causes that after message peak erlang client is not able to recover without: restart (then we lost some messages) or app suspend (we stop sending messages, use other client and wait many hours until rabbit_writer mailbox will be empty, this processing last quite long due to gc execution on process with large heap).
@michaelklishin We won't be merging, leave alone backporting this to 3.8 and 3.7, without evidence.
I explain in detail this issue: often GC execution makes erlang client not stable during high load. Have you execute GC in other clients also (eg, Java)? This PR is very simple and introduce minimal risk of regressions, so just let the people set their owl thresholds for gc or disable it at all if they really need it. Today we have performance tests. I will put result here ASAP.
@kjnilsson I'm thinking this process could benefit from using process_flag(message_queue_data, off_heap) so that any messages in the queue aren't scanned during GC. For performance we could also consider forcing a minor gc instead of a major (erlang:garbage_collect(self(), [{type, minor}])) which may be sufficient to collect transient binrefs.
Yes I aggre it can help, but configurable gc threshold will also increase erlang client stability and performance.
@nyczol the memory allocation and usage metrics which can be quite easily collected using our Prometheus plugin and its Erlang Memory Allocators collector/Grafana dashboard.
This PR is very simple and introduce minimal risk of regressions
Yeah, we see these all the time. Unfortunately some of the time they do introduce substantial regressions for some users and it's our team who are responsible for the software we ship and asked to do something about it, not the contributor. Sorry, we prefer not to guess about what we merge.
so just let the people set their owl thresholds for gc or disable it at all if they really need it
We will but we would like to understand the effects of different values and potentially change the default. Again, guessing is too risky and expensive.
We will run our own benchmarks with this and related PRs to see the effects of the current default and a few others (say, one and two orders of magnitude greater) on peak binary heap size. Perhaps you would want to do the same for your workload instead of going with your gut but who am I to suggest anything.
@nyczol By setting off_heap
messages you should be able to avoid the runaway gc scenario you outlined above. It will also make full sweep GC faster to run. You can enable this flag already without code changes so I would suggest you give it a go in your test system.
One thing to note about the current code is that when reaching the limit and forcing gc the message that caused the limit to be reached won't be collected as it is still referenced, somewhat limiting the benefits of the forced gc call. It may be better to self send a message to act on instead.
Our findings so far are positive. This is a clear improvement with small messages, which was expected. Large message workload is yet to be tested.
With default bumped to 1GB and small-ish 4kB messages, I observe a remarkable 18 to 65% reduction in tail latencies and a low teen increase in consumer-side throughput. Peak memory (carrier size) use is comparable. @dcorbacho confirmed meaningful improvements in her testing. This looks quite promising!
With large messages (50 MB) I observe a roughly 25% throughput increase, same peak memory use, and a single-digit tail latency reduction but higher peak I/O throughput. My setup is likely network link constraint at this point.
Thanks for merge. Are you going to add this improvement to 3.7.x branch also?
Backported to v3.8.x
and v3.7.x
.
FYI, I will change the config key to a more specific writer_gc_threshold
and will add a new style config schema mapping for it.
This PR gives possibility to configure gc threshold - how often gc will be run (if any).
According to erlang doc: Improper use can seriously degrade system performance.
This performance degradation accually occurs on production env.
Please also check: https://groups.google.com/forum/#!searchin/rabbitmq-users/nycz%7Csort:date/rabbitmq-users/hVlXjmG6suk/f43miaI-AQAJ