Node can use significant (e.g. 80GB) amounts of RAM during startup

gerhard commented 4 years ago

Part of TGIR S01E02: Help! RabbitMQ ate my RAM! I came across the following unexpected behaviour when rabbit app gets restart via rabbitmqctl stop_app ; rabbitmqctl start_app

A worker process is using 56GB of physical memory while reading through all queue segments, across all queues. This operation takes ~30 minutes to complete (13:44:16 - 14:15:46) with the disk reads starting at 410MiB/s and maxing out at 965MiB/s for 15 minutes straight:

This is what the system physical memory breakdown looks like:

Notice:

Slow 1.8GiB -> 7.7GiB memory growth spanning ~13 minutes (13:28:00 - 13:41:00)
Large spike to 42GiB the next minute (13:42:00)
Another spike to 60GiB for which cached memory is made available (13:43:00)
Last spike to 80GiB (13:43:45) 🤯
A drop to 60GiB for 30 minutes during which the disk gets hammered (13:44:15)
Drop to 8GiB (14:16:00) followed by a slow ramp-up during which publishers/consumers get reconnected
Regular 11.5GiB -> 17.5GiB spikes, as expected for this workload.

I have captured the runtime thread stats to show the 88% idle regular schedulers (most time is spent in send) and 94% idle dirty_io schedulers. All other thread types are 98% or above idle.

rabbitmq-diagnostics runtime_thread_stats --sample-interval 60 | grep -v async
Will collect runtime thread stats on rabbit@tgir-s01e02-gerhard-rabbitmq-durable for 60 seconds...
Average thread real-time    :  60152930 us
Accumulated system run-time : 162534659 us
Average scheduler run-time  :   7169362 us

        Thread    alloc      aux      bifbusy_wait check_io emulator      ets       gc  gc_full      nif    other     port     send    sleep   timers

Stats per thread:
       aux( 1)    0.59%    0.64%    0.00%    0.00%    0.12%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   98.65%    0.00%
dirty_cpu_( 1)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_cpu_( 2)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_cpu_( 3)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_cpu_( 4)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_cpu_( 5)    0.57%    0.00%    0.00%    2.62%    0.00%    0.02%    0.00%    1.75%    0.77%    0.00%    0.01%    0.00%    0.00%   94.26%    0.00%
dirty_cpu_( 6)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_cpu_( 7)    0.56%    0.00%    0.00%    2.61%    0.00%    0.02%    0.00%    1.68%    0.72%    0.00%    0.01%    0.00%    0.00%   94.40%    0.00%
dirty_cpu_( 8)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_cpu_( 9)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_cpu_(10)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_cpu_(11)    0.66%    0.00%    0.00%    2.96%    0.00%    0.02%    0.00%    1.92%    0.83%    0.00%    0.01%    0.00%    0.00%   93.60%    0.00%
dirty_cpu_(12)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_cpu_(13)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_cpu_(14)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_cpu_(15)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_cpu_(16)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_io_s( 1)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_io_s( 2)    0.03%    0.00%    0.00%    1.22%    0.00%    0.02%    0.00%    0.00%    0.00%    6.10%    0.01%    0.00%    0.00%   92.63%    0.00%
dirty_io_s( 3)    0.01%    0.00%    0.00%    0.49%    0.00%    0.01%    0.00%    0.00%    0.00%    2.62%    0.00%    0.00%    0.00%   96.87%    0.00%
dirty_io_s( 4)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_io_s( 5)    0.05%    0.00%    0.00%    2.14%    0.00%    0.03%    0.00%    0.00%    0.00%   10.72%    0.01%    0.00%    0.00%   87.05%    0.00%
dirty_io_s( 6)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
dirty_io_s( 7)    0.06%    0.00%    0.00%    1.67%    0.00%    0.02%    0.00%    0.00%    0.00%    8.26%    0.01%    0.00%    0.00%   89.98%    0.00%
dirty_io_s( 8)    0.06%    0.00%    0.00%    2.36%    0.00%    0.03%    0.00%    0.00%    0.00%   11.10%    0.01%    0.00%    0.00%   86.44%    0.00%
dirty_io_s( 9)    0.06%    0.00%    0.00%    2.20%    0.00%    0.03%    0.00%    0.00%    0.00%   11.07%    0.01%    0.00%    0.00%   86.63%    0.00%
dirty_io_s(10)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
      poll( 0)    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
 scheduler( 1)    3.07%    0.36%    0.07%    4.51%    0.01%   13.89%    0.69%    0.12%    0.00%    0.01%    0.08%    0.00%   21.67%   55.52%    0.01%
 scheduler( 2)    4.03%    0.29%    0.10%    4.62%    0.00%   18.43%    0.89%    0.16%    0.00%    0.01%    0.10%    0.00%   27.04%   44.34%    0.00%
 scheduler( 3)    1.93%    0.25%    0.07%    2.81%    0.00%    8.98%    0.48%    0.08%    0.00%    0.01%    0.05%    0.00%   23.87%   61.46%    0.00%
 scheduler( 4)    2.84%    0.27%    0.08%    3.69%    0.00%   13.01%    0.67%    0.12%    0.00%    0.01%    0.07%    0.00%   26.82%   52.40%    0.00%
 scheduler( 5)    0.00%    0.00%    0.00%    0.36%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   99.63%    0.00%
 scheduler( 6)    0.00%    0.01%    0.00%    0.43%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   99.56%    0.00%
 scheduler( 7)    0.00%    0.00%    0.00%    0.31%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   99.68%    0.00%
 scheduler( 8)    0.00%    0.00%    0.00%    0.31%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   99.68%    0.00%
 scheduler( 9)    0.00%    0.00%    0.00%    0.32%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   99.67%    0.00%
 scheduler(10)    0.00%    0.00%    0.00%    0.32%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.38%    0.00%    0.00%   99.30%    0.00%
 scheduler(11)    0.00%    0.00%    0.00%    0.30%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   99.70%    0.00%
 scheduler(12)    0.00%    0.00%    0.00%    0.40%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   99.59%    0.00%
 scheduler(13)    0.00%    0.00%    0.00%    0.33%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   99.67%    0.00%
 scheduler(14)    0.00%    0.00%    0.00%    0.32%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   99.67%    0.00%
 scheduler(15)    0.00%    0.00%    0.00%    0.31%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   99.68%    0.00%
 scheduler(16)    0.00%    0.00%    0.00%    0.28%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   99.71%    0.00%

Stats per type:
           aux    0.59%    0.64%    0.00%    0.00%    0.12%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%   98.65%    0.00%
dirty_cpu_sche    0.11%    0.00%    0.00%    0.51%    0.00%    0.00%    0.00%    0.33%    0.14%    0.00%    0.00%    0.00%    0.00%   98.89%    0.00%
dirty_io_sched    0.03%    0.00%    0.00%    1.01%    0.00%    0.01%    0.00%    0.00%    0.00%    4.99%    0.01%    0.00%    0.00%   93.96%    0.00%
          poll    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%    0.00%  100.00%    0.00%
     scheduler    0.74%    0.08%    0.02%    1.23%    0.00%    3.39%    0.17%    0.03%    0.00%    0.00%    0.04%    0.00%    6.21%   88.08%    0.00%

I am attaching the netdata export: netdata-c8a3200a7c2f-20200224-130158-6846.snapshot.zip

There is no Prometheus metrics snapshot for the Erlang VM since this behaviour happens before the Prometheus app is started:

pjk25 commented 4 years ago

After doing some investigation, I believe this is caused by the rebuilding of message refcounts by rabbit_msg_store. We can the disk asynchronously, and in a few limited tests, I saw that we scan and enqueue faster than we consume, resulting in a very large backlog of messages in memory. The [erlang] queues in question are unbounded, so my initial thought is to bound those queues in a matter that puts are blocking, which will throttle the disk reads to limit the data transiently in memory.

michaelklishin commented 4 years ago

@pjk25 we already have a mechanism for the producer-consumer problem and unbounded buffers in RabbitMQ. It's called credit flow. It's a producer side solution so if we don't control the "producer side" of this process interaction it cannot be used but sounds like we do. No need for blocking puts, producers can self-limit using credit_flow.

pjk25 commented 4 years ago

After further exploration, I have learned that the gatherer managing the above mentioned queues blocks its publishers until messages are consumed, invalidating the above theory.

pjk25 commented 4 years ago

@michaelklishin thanks for the insight. The queues I were looking at were these https://github.com/rabbitmq/rabbitmq-server/blob/f8d47976d999e8816dacf981f86b71f387dc18fc/src/gatherer.erl#L90, so not "rabbit" queues, although I'm sure I could have made my comment more clear.

pjk25 commented 4 years ago

It's worth noting that even though we shut down the broker with rabbitmqctl stop_app first, we still follow this code path, which I believe corresponds to an unclean shutdown https://github.com/rabbitmq/rabbitmq-server/blob/abc0b522de83e36650e873d2065646dc1f4ae1d4/src/rabbit_msg_store.erl#L1735

Also, the call to https://github.com/rabbitmq/rabbitmq-server/blob/abc0b522de83e36650e873d2065646dc1f4ae1d4/src/rabbit_msg_store.erl#L1737 takes about 10 minutes to complete, before we get to the point where memory is consumed rapidly.

gerhard commented 4 years ago

I've just tested this in the context of TGIR S01E03 and can confirm that #2274 fixes this.

The memory usage doesn't go above 8GiB and even though the recovery time didn't improve (it still took ~45 minutes for the node to restart), the memory usage was perfect: the ETS data structures used the expected memory, all other components only used what they needed.

Watch the video if you're interested to see how this was confirmed 🙌🏻

TL;DW

rabbitmq / rabbitmq-server

Node can use significant (e.g. 80GB) amounts of RAM during startup #2254