noxdafox / rabbitmq-cloudwatch-exporter

RabbitMQ Plugin for publishing cluster metrics to AWS CloudWatch
Mozilla Public License 2.0
41 stars 9 forks source link

bug with cluster name / node IP metric aggregation #44

Open JoeTaylor95 opened 1 year ago

JoeTaylor95 commented 1 year ago

If in use by an ASG or likely nodes are to be replaced..

The cluster name is likely to change along with the node IP, these really shouldn't be aggregated as when creating an alarm these values are likely to change if a node is replaced and there's a new leader.

It makes it impossible to create an alarm based off these metrics.

A fix for this is to remove metric aggregation and to have the cluster name parametrised

noxdafox commented 1 year ago

Hello,

The issue at hand is not clear as stated. Could you please better clarify?

The plugin does not aggregates metrics, this is done by the broker itself via the rabbitmq-management plugin.

You can already customize the dimensions by setting your preferred namespace configuration value.

JoeTaylor95 commented 1 year ago

Hi,

the issue is with the metrics that’s aggregated; so there’s node, and cluster name which are being aggregated. The issue is if there’s an alarm that’s created, the metric filter would include the aggregated metrics so node and cluster would be needed. But the issue comes when the node is replaced, this value (rightfully) would change.

A solution for this would be to use a customer cluster name and custom node name or to remove node value from the aggregated metrics, as the cluster name can be changed.

noxdafox commented 1 year ago

Can you highlight which metrics are you interested in which get aggregated?

What kind of alarms are you trying to set up with CW metric filters?

Are you aware that you can actually set node names yourself via the RABBITMQ_NODENAME environment variable so they remain static once ASG rotates them?

JoeTaylor95 commented 1 year ago

Sure, So I'm creating alarms from the following CW aggregation[Cluster, Limit, Metric, Node, Type] specifically FileDescriptors (threshold > 30000), but also looking at DiskFree and Memory.

Currently the ASG will replace an instance (mainly for system patching, so this will happen once a month min) and the node hostnames are using what AWS set as the defaults, which is fine. (combination of LAN IP)

The issue I have is that If there's multiple nodes which are in a cluster, the alarms can be aggregated but as the node value is irrelevant; I only care about the cluster name as the above metrics would follow each over across the cluster.

Plus, when creating an alarm. if the filter is set to Cluster name Xyz, FileDescriptors> 30k, then this would cover all active nodes within the cluster.

if for example I were to use the node name, as this has to be unique it wouldn't work when CW using alarms as if preserving the node name with would conflict with a node which is having is connections drained.

I hope this makes sense. Effectively if there's any scaling actions in an ASG, any CW alarms would all need updating accordingly.

A fix to this would be to enable to removal of the node metric so its not include in the aggregation or to have an additional metric aggregation where this metric is not include.

Also, the cluster name would also need to be customised, but I think this might be defined in RabbitMQ clustering