make garbage collector threshold configurable

nyczol commented 4 years ago

This PR gives possibility to configure gc threshold - how often gc will be run (if any).

According to erlang doc: Improper use can seriously degrade system performance.

This performance degradation accually occurs on production env.

Please also check: https://groups.google.com/forum/#!searchin/rabbitmq-users/nycz%7Csort:date/rabbitmq-users/hVlXjmG6suk/f43miaI-AQAJ

pivotal-issuemaster commented 4 years ago

@nyczol Please sign the Contributor License Agreement!

Click here to manually synchronize the status of this Pull Request.

See the FAQ for frequently asked questions.

pivotal-issuemaster commented 4 years ago

@nyczol Thank you for signing the Contributor License Agreement!

michaelklishin commented 4 years ago

Are there any metrics/numbers you can share that demonstrate the effect?

nyczol commented 4 years ago

Issue is indeterministic and occurs on OCS system that processes AAA requests (RADIUS/diameter). Some results are publish to RabbitMQ by erlang client (version 3.7.9). From time to time we can see that rabbit_writer process memory is growing to many GB (process mailbox contains milions of messages) and erlang:process_info(P, current_stacktrace) reports that process is executing rabbit_writer:maybe_gc_large_msg/erlang:garbage_collect. RabbitMQ server is not overloaded. On client site load is about 300 messages per seconds per rabbit_writer process.

nyczol commented 4 years ago

We are going to increase gc threshold a lot or even disable gc at all. It is possible to add this PR to 3.7 and 3.8 branches?

lukebakken commented 4 years ago

From time to time we can see that rabbit_writer process memory is growing to many GB (process mailbox contains milions of messages)

This suggests a network issue, or your client applications aren't keeping up.

Please use a tool like netstat (or similar) to check for high TCP Send-Q values during one of these events.

nyczol commented 4 years ago

Please note that by defaut GC is executed every ~1MB. For messages that have 1kB size and load 100 messages per seconds GC is executed every 10 sec. In general frequency of GC execution depends on client load. With load 1000 messages per seconds (message size 1kB) GC is executed every 1 second (!!!). We must also take into account that execution time for erlang:garbage_collect/0 is O(N) (or worst) when N is process memory. Please also check erlang doc for garbage_collect: Forces an immediate garbage collection of the executing process. The function is not to be used unless it has been noticed (or there are good reasons to suspect) that the spontaneous garbage collection will occur too late or not at all. Warning! Improper use can seriously degrade system performance. Now I would like to explain how this gc execution cause indeterministic performance degradation in erlang client: Let's assume that average load on client is N messages per seconds and it takes 500ms to send all these messages. Now lets assume that average GC execution time is 100ms. With such load client works properly. Now lets assume some message peaks 2*N messages, processing time for this peaks is 1sec but also gc must be executed so not all messages can be processes. Some stay in rabbit_writer mailbox. More messages in mail box more time is consumed for GC and less time left for real processing (writing messages to socket). Finally when we reach some critical mailbox size rabbit_writer process on client site stays in overload state forever (memory is still growing). Increase gc thresold will help to optimaise client performance in case of very high load and will protect rabbbit_writer process against overload.

michaelklishin commented 4 years ago

@nyczol we would really like to see some metrics that demonstrate the effect of this. As @lukebakken pointed out already, a large rabbit_writer heap is usually an indication of the fact that it cannot send protocol methods fast enough.

We won't be merging, leave alone backporting this to 3.8 and 3.7, without evidence.

michaelklishin commented 4 years ago

These unfortunate GC calls had to be added because Erlang's binary heap collection is not the most predictable thing in the world and with really large messages, really large binaries stayed on the heap for a long time. This was introduced several years ago on a different version of OTP. Since then, max message size has been reduced from 2 GB to 128 MB (by default) in 3.8. So maybe it's less of an issue today on OTP 22.

Nonetheless, we need cold hard data on the effects of this change both for your workload and for a workload that publishes messages of 128 MB in size with a high enough rate.

kjnilsson commented 4 years ago

I'm thinking this process could benefit from using process_flag(message_queue_data, off_heap) so that any messages in the queue aren't scanned during GC. For performance we could also consider forcing a minor gc instead of a major (erlang:garbage_collect(self(), [{type, minor}])) which may be sufficient to collect transient binrefs.

nyczol commented 4 years ago

@michaelklishin we would really like to see some metrics that demonstrate the effect of this

What kind of metrics would you like to see exectly?

@lukebakken This suggests a network issue, or your client applications aren't keeping up.

Yes exaclty from time to time when message peak occurs my client app aren't keeping up. But for most time it is able to send all message without any problem. Please note that GC issue we are currently talking about causes that after message peak erlang client is not able to recover without: restart (then we lost some messages) or app suspend (we stop sending messages, use other client and wait many hours until rabbit_writer mailbox will be empty, this processing last quite long due to gc execution on process with large heap).

@michaelklishin We won't be merging, leave alone backporting this to 3.8 and 3.7, without evidence.

I explain in detail this issue: often GC execution makes erlang client not stable during high load. Have you execute GC in other clients also (eg, Java)? This PR is very simple and introduce minimal risk of regressions, so just let the people set their owl thresholds for gc or disable it at all if they really need it. Today we have performance tests. I will put result here ASAP.

nyczol commented 4 years ago

@kjnilsson I'm thinking this process could benefit from using process_flag(message_queue_data, off_heap) so that any messages in the queue aren't scanned during GC. For performance we could also consider forcing a minor gc instead of a major (erlang:garbage_collect(self(), [{type, minor}])) which may be sufficient to collect transient binrefs.

Yes I aggre it can help, but configurable gc threshold will also increase erlang client stability and performance.

michaelklishin commented 4 years ago

@nyczol the memory allocation and usage metrics which can be quite easily collected using our Prometheus plugin and its Erlang Memory Allocators collector/Grafana dashboard.

This PR is very simple and introduce minimal risk of regressions

Yeah, we see these all the time. Unfortunately some of the time they do introduce substantial regressions for some users and it's our team who are responsible for the software we ship and asked to do something about it, not the contributor. Sorry, we prefer not to guess about what we merge.

so just let the people set their owl thresholds for gc or disable it at all if they really need it

We will but we would like to understand the effects of different values and potentially change the default. Again, guessing is too risky and expensive.

michaelklishin commented 4 years ago

We will run our own benchmarks with this and related PRs to see the effects of the current default and a few others (say, one and two orders of magnitude greater) on peak binary heap size. Perhaps you would want to do the same for your workload instead of going with your gut but who am I to suggest anything.

kjnilsson commented 4 years ago

@nyczol By setting off_heap messages you should be able to avoid the runaway gc scenario you outlined above. It will also make full sweep GC faster to run. You can enable this flag already without code changes so I would suggest you give it a go in your test system.

One thing to note about the current code is that when reaching the limit and forcing gc the message that caused the limit to be reached won't be collected as it is still referenced, somewhat limiting the benefits of the forced gc call. It may be better to self send a message to act on instead.

michaelklishin commented 4 years ago

Our findings so far are positive. This is a clear improvement with small messages, which was expected. Large message workload is yet to be tested.

michaelklishin commented 4 years ago

With default bumped to 1GB and small-ish 4kB messages, I observe a remarkable 18 to 65% reduction in tail latencies and a low teen increase in consumer-side throughput. Peak memory (carrier size) use is comparable. @dcorbacho confirmed meaningful improvements in her testing. This looks quite promising!

With large messages (50 MB) I observe a roughly 25% throughput increase, same peak memory use, and a single-digit tail latency reduction but higher peak I/O throughput. My setup is likely network link constraint at this point.

nyczol commented 4 years ago

Thanks for merge. Are you going to add this improvement to 3.7.x branch also?

michaelklishin commented 4 years ago

Backported to v3.8.x and v3.7.x.

michaelklishin commented 4 years ago

FYI, I will change the config key to a more specific writer_gc_threshold and will add a new style config schema mapping for it.

rabbitmq / rabbitmq-common

make garbage collector threshold configurable #343