rabbitmq / discussions

Please use RabbitMQ mailing list for questions. Issues that are questions, discussions or lack details necessary to investigate them are moved to this repository.
3 stars 4 forks source link

RabbitMQ 3.8.5: high CPU utilization due to busy waiting of scheduler threads #151

Closed gopivalleru closed 4 years ago

gopivalleru commented 4 years ago

I've upgraded from rabbitMQ version: 3.4.4 on Erlang R16B03 (erts-5.10.4) to rabbitMQ 3.8.5 on Erlang 11.0.3 (OTP 23). When I ran rabbitmq-perf-test on 3.8.5 it was able to handle 1000's of messages per second with 500 producers and 500 consumers. (https://rabbitmq.github.io/rabbitmq-perf-test/stable/htmlsingle). I did see CPU utilization around 500%-600% on 8 CPU system where as it stayed below 1 CPU utilization for 100-200 messages/sec which is the case of 3.4.4 production cluster. But in production 3.8.5 is taking 50% more CPU when compared to 3.4.4.

Below is the output of rabbitmq-diagnostics runtime_thread_stats which is taking more time on other (I assume its busy wait). The default value of scheduler busy wait (sbwt) is medium on both the version of erlang. When I changed this to None the CPU utilization went down from 4-5 CPU's to 1 CPU, when I tried very_short then CPU utilization is same as 3.4.4 which has sbwt set to medium (default from erlang docs). Is this a know issue?

Average thread real-time    :  5000228 us
Accumulated system run-time : 12160527 us
Average scheduler run-time  :  1511950 us

        Thread      aux check_io emulator       gc    other     port    sleep

Stats per thread:
 scheduler( 1)    0.47%    0.52%    7.89%    0.54%   52.64%    1.27%   36.68%
 scheduler( 2)    0.53%    0.49%    7.15%    0.52%   51.70%    1.11%   38.51%
 scheduler( 3)    0.50%    0.50%    7.46%    0.51%   52.58%    1.17%   37.28%
 scheduler( 4)    0.40%    0.35%    5.33%    0.36%   46.75%    0.79%   46.01%
(ignoring scheduler 5 to 8)
michaelklishin commented 4 years ago

This is a known aspect of how the Erlang runtime works.

You are right to look at runtime thread metrics as without them, we cannot suggest anything. In 3.8, there are more processes for every virtual host involved. It also inevitably has more features which may affect CPU usage depending on the workload.

The default value of +sbwt and friends was changed to very_low in modern RabbitMQ releases, although I would not be surprised if https://github.com/rabbitmq/rabbitmq-server/pull/2142 unintentionally changed this default (it was backported for 3.8.4). We have also considered disabling it entirely but that did not have a meaningful impact compared to the default. In fact, it's the number of stats-emitting entities that matter, not the message rates.

There is an easy way to reduce CPU context switching and utilization rate for low-to-moderate loaded systems (100-200 messages a second falls into that category).

This rabbitmq-users thread was an inspiration for the above changes.

michaelklishin commented 4 years ago

Looks like we have changed the default for the nodes we start in integration tests but not the default Erlang VM startup arguments. But this certainly has been discussed, in fact, more than once, so benchmarks must have shown that this is not a no-brainer (some workloads were negatively affected).

gopivalleru commented 4 years ago

Mike,

Setting below value will for sure reduce CPU utilization.

RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="+sbwt very_short +sbwtdcpu very_short +sbwtdio very_short"

or

RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="+sbwt none +sbwtdcpu none +sbwtdio none"

But I want to know erts-5.10.4 has only sbwt flag with default value as medium (http://erlang.org/documentation/doc-5.10.4/erts-5.10.4/doc/html/erl.html). Erlang 11.0.3 has sbwt flag with default value as medium (http://erlang.org/documentation/doc-11.0-rc3/erts-11.0/doc/html/erl.html#+sbt) but also introduced sbwtdcpu and sbwtdio with default value as short. From settings nothing has changed with scheduler busy wait threshold and I do see twice CPU utilization with 11.0.3.

  1. Does anyone know there are significant amount of change in scheduling that would cause this?
  2. If yes, then apart from consuming CPU when traffic is sparse does this have any negative impact when traffic is continuous and number of messages doubles? Technically, in this scenario scheduler threads won't be in busy wait state.
  3. I tried to replicate my production scianrio with 500 messages/secs for 10 secs and keeping it idle for 3 secs but rabbitMQ is only utilizing 1 CPU and less 10% on of busy wait. I want to replicate a scenario where for a workload 3.8.5 has 50% on busy wait consuming x CPU's while 3.4.4 consumes x/2 CPU's. Then go back to 3.8.5 and set sbwt to very_short and run the test again to see if there was any significant impact on throughput.

./runjava com.rabbitmq.perf.PerfTest --queue-pattern 'perf-test-%d' --queue-pattern-from 1 --queue-pattern-to 6 --producers 500 --consumers 500 --json-body --size 16000 --body-content-type application/json --variable-rate 1:10 --variable-rate 0:3

I've to rely on at top -p -H for 3.4.4 as we can't get runtime_thread_stats in this version.

gopivalleru commented 4 years ago

I'm using below mail-list https://groups.google.com/g/rabbitmq-users/c/BYJzgySEdr8/m/3uZfeLjbCQAJ

michaelklishin commented 4 years ago

Busy waiting can be a net positive for workloads that are bursty in nature (involve short pauses and spikes). Questions about Erlang scheduling are best directed at the Erlang mailing list, erlang-users. Currently RabbitMQ does not override any relevant defaults.

minusdavid commented 4 years ago

With RabbitMQ 3.6.10 and Erts 9.2, I'm using "+sbwt none" which seems to reduce the impact of busy wait, but does not seem to completely eliminate it.

https://www.rabbitmq.com/runtime.html#busy-waiting says that the runtime can put schedulers to sleep, but I haven't seen any evidence that schedulers actually ever go to sleep (ie no sleep related syscalls and beam.smp spews out a long stream of clock_getttime and futex syscalls).

minusdavid commented 4 years ago

Might this be an issue with older versions of RabbitMQ/Erlang, or is the documentation not quite as clear as it could be?

lukebakken commented 4 years ago

Closing this discussion as it has moved to the mailing list:

https://groups.google.com/d/topic/rabbitmq-users/BYJzgySEdr8/discussion