rabbitmq / cluster-operator

RabbitMQ Cluster Kubernetes Operator
https://www.rabbitmq.com/kubernetes/operator/operator-overview.html
Mozilla Public License 2.0
884 stars 273 forks source link

Quorum queue memory usage, and sensible defaults for node resource requests #1553

Closed tkoft closed 9 months ago

tkoft commented 9 months ago

Describe the bug

We occasionally find in our GKE rabbitmq cluster (3 nodes) that memory alarms will get triggered, and causing issues with publishes. Digging through docs, I've learned that:

It seems like for a cluster using these defaults, a memory alarm will certainly be triggered at some point, even with just one quorum queue? There are reports of some folks seeing this issue unless they lower the default WAL size limit.

I was also unclear on what happens when these alarms are set. The docs say that publishes are blocked, but is that to the offending node only or to the whole cluster? I did find in AWS MQ docs this:

In cluster deployments, queues might experience paused synchronization of messages between replicas on different nodes. Paused queue syncs prevent consumption of messages from queues and must be addressed separately while resolving the memory alarm.

So a couple questions:

Regardless, it seems to me that more sensible defaults could be configured here.

To Reproduce

Steps to reproduce the behavior:

  1. Deploy a cluster using RabbitMQ Cluster Kubernetes Operator
  2. Publish and consume a quorum queue on the instance
  3. Observe memory usage increases on a node until memory alarm is set
  4. From memory use reporting on the node, observe that quorum queue tables are what's growing and causing the alarm
  5. Publishes start getting blocked, even though two other nodes are under the high-watermark

Expected behavior By default, I expect quorum queue WAL size threshold and cluster operator's memory requests to work with each other so memory alarm's aren't triggered by normal usage of quorum queues.

Screenshots

Screenshot 2024-02-06 at 1 40 45 PM

Version and environment information

lukebakken commented 9 months ago

This discussion may be informative - https://github.com/rabbitmq/cluster-operator/discussions/1537#discussioncomment-8384910

cc @mkuratczyk