We occasionally find in our GKE rabbitmq cluster (3 nodes) that memory alarms will get triggered, and causing issues with publishes. Digging through docs, I've learned that:
It seems like for a cluster using these defaults, a memory alarm will certainly be triggered at some point, even with just one quorum queue? There are reports of some folks seeing this issue unless they lower the default WAL size limit.
I was also unclear on what happens when these alarms are set. The docs say that publishes are blocked, but is that to the offending node only or to the whole cluster? I did find in AWS MQ docs this:
In cluster deployments, queues might experience paused synchronization of messages between replicas on different nodes. Paused queue syncs prevent consumption of messages from queues and must be addressed separately while resolving the memory alarm.
So a couple questions:
Do memory alarms indeed block publishes to all nodes? Or just the one?
If so, why is this the case? Is there any way to mitigate the risk of the entire cluster being paused from one node triggering an alarm?
If not, can a memory alarm on one node cause quorum queue replicas on other nodes to stop accepting publishes too like what's describe in AWS docs?
Regardless, it seems to me that more sensible defaults could be configured here.
To Reproduce
Steps to reproduce the behavior:
Deploy a cluster using RabbitMQ Cluster Kubernetes Operator
Publish and consume a quorum queue on the instance
Observe memory usage increases on a node until memory alarm is set
From memory use reporting on the node, observe that quorum queue tables are what's growing and causing the alarm
Publishes start getting blocked, even though two other nodes are under the high-watermark
Expected behavior
By default, I expect quorum queue WAL size threshold and cluster operator's memory requests to work with each other so memory alarm's aren't triggered by normal usage of quorum queues.
Screenshots
Version and environment information
RabbitMQ: Cluster operator default (3.10.2)
RabbitMQ Cluster Operator: 2.1.0
Kubernetes: 1.26.6-gke.1700
Cloud provider or hardware configuration: GKE autopilot
Describe the bug
We occasionally find in our GKE rabbitmq cluster (3 nodes) that memory alarms will get triggered, and causing issues with publishes. Digging through docs, I've learned that:
It seems like for a cluster using these defaults, a memory alarm will certainly be triggered at some point, even with just one quorum queue? There are reports of some folks seeing this issue unless they lower the default WAL size limit.
I was also unclear on what happens when these alarms are set. The docs say that publishes are blocked, but is that to the offending node only or to the whole cluster? I did find in AWS MQ docs this:
So a couple questions:
Regardless, it seems to me that more sensible defaults could be configured here.
To Reproduce
Steps to reproduce the behavior:
Expected behavior By default, I expect quorum queue WAL size threshold and cluster operator's memory requests to work with each other so memory alarm's aren't triggered by normal usage of quorum queues.
Screenshots
Version and environment information