Dubious metric names in Prometheus

cykl commented 1 month ago

While inspecting Cassandra related metrics in our Prometheus server, I noticed that some metrics were badly re-labelled from the Cassandra model to the Prometheus model leading to numerous Prometheus metric .

For example:

org_apache_cassandra_metrics_connection_large_message_dropped_tasks_201_24_56_224_7000
org_apache_cassandra_metrics_connection_large_message_dropped_tasks_201_24_110_239_7000

I'm using the new metric endpoint without any additional configuration.

Impacted metrics are the following (I stripped the port, ip address, etc. to obtain expected metric name):

io_cassandrareaper_management_ICassandraManagementProxy_connections
io_cassandrareaper_management_jmx_JmxCassandraManagementProxy_cpicassandracluster_repairStatusHandlers
io_cassandrareaper_service_SegmentRunner_abort
org_apache_cassandra_metrics_connection_gossip_message_completed_tasks
org_apache_cassandra_metrics_connection_gossip_message_dropped_tasks
org_apache_cassandra_metrics_connection_gossip_message_pending_tasks
org_apache_cassandra_metrics_connection_large_message_completed_bytes
org_apache_cassandra_metrics_connection_large_message_completed_tasks
org_apache_cassandra_metrics_connection_large_message_dropped_bytes_due_to_error
org_apache_cassandra_metrics_connection_large_message_dropped_bytes_due_to_overload
org_apache_cassandra_metrics_connection_large_message_dropped_bytes_due_to_timeout
org_apache_cassandra_metrics_connection_large_message_dropped_tasks
org_apache_cassandra_metrics_connection_large_message_dropped_tasks_due_to_error
org_apache_cassandra_metrics_connection_large_message_dropped_tasks_due_to_overload
org_apache_cassandra_metrics_connection_large_message_dropped_tasks_due_to_timeout
org_apache_cassandra_metrics_connection_large_message_pending_bytes
org_apache_cassandra_metrics_connection_large_message_pending_tasks
org_apache_cassandra_metrics_connection_small_message_completed_bytes
org_apache_cassandra_metrics_connection_small_message_completed_tasks
org_apache_cassandra_metrics_connection_small_message_dropped_bytes_due_to_error
org_apache_cassandra_metrics_connection_small_message_dropped_bytes_due_to_overload
org_apache_cassandra_metrics_connection_small_message_dropped_bytes_due_to_timeout
org_apache_cassandra_metrics_connection_small_message_dropped_tasks
org_apache_cassandra_metrics_connection_small_message_dropped_tasks_due_to_error
org_apache_cassandra_metrics_connection_small_message_dropped_tasks_due_to_overload
org_apache_cassandra_metrics_connection_small_message_dropped_tasks_due_to_timeout
org_apache_cassandra_metrics_connection_small_message_pending_bytes
org_apache_cassandra_metrics_connection_small_message_pending_tasks
org_apache_cassandra_metrics_connection_timeouts_total
org_apache_cassandra_metrics_connection_urgent_message_completed_bytes
org_apache_cassandra_metrics_connection_urgent_message_completed_tasks
org_apache_cassandra_metrics_connection_urgent_message_dropped_bytes_due_to_error
org_apache_cassandra_metrics_connection_urgent_message_dropped_bytes_due_to_overload
org_apache_cassandra_metrics_connection_urgent_message_dropped_bytes_due_to_timeout
org_apache_cassandra_metrics_connection_urgent_message_dropped_tasks
org_apache_cassandra_metrics_connection_urgent_message_dropped_tasks_due_to_error
org_apache_cassandra_metrics_connection_urgent_message_dropped_tasks_due_to_overload
org_apache_cassandra_metrics_connection_urgent_message_dropped_tasks_due_to_timeout
org_apache_cassandra_metrics_connection_urgent_message_pending_bytes
org_apache_cassandra_metrics_connection_urgent_message_pending_tasks
org_apache_cassandra_metrics_hints_service_hint_delays
org_apache_cassandra_metrics_hints_service_hint_delays_count
org_apache_cassandra_metrics_hints_service_hints_created
org_apache_cassandra_metrics_hints_service_hints_not_stored
org_apache_cassandra_metrics_inbound_connection_corrupt_frames_recovered
org_apache_cassandra_metrics_inbound_connection_corrupt_frames_unrecovered
org_apache_cassandra_metrics_inbound_connection_error_bytes
org_apache_cassandra_metrics_inbound_connection_error_count
org_apache_cassandra_metrics_inbound_connection_expired_bytes
org_apache_cassandra_metrics_inbound_connection_expired_count
org_apache_cassandra_metrics_inbound_connection_processed_bytes
org_apache_cassandra_metrics_inbound_connection_processed_count
org_apache_cassandra_metrics_inbound_connection_received_bytes
org_apache_cassandra_metrics_inbound_connection_received_count
org_apache_cassandra_metrics_inbound_connection_scheduled_bytes
org_apache_cassandra_metrics_inbound_connection_scheduled_count
org_apache_cassandra_metrics_inbound_connection_throttled_count
org_apache_cassandra_metrics_inbound_connection_throttled_nanos

None of those metrics seems documented in https://cassandra.apache.org/doc/stable/cassandra/operating/metrics.html#client-metrics but it's easy to find them in Cassandra source code.

Should the default relabel configuration be updated to take those metrics into account or am I doing something wrong?

burmanm commented 1 month ago

Should the default relabel configuration be updated to take those metrics into account or am I doing something wrong?

Should definitely be added. The current relabeling rules are based on the documented metrics, but clearly this isn't enough to get everything done correctly.

What Cassandra version was used?

burmanm commented 1 month ago

In any case, the metrics names come from these lines:

https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/metrics/InternodeInboundMetrics.java#L54 https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/metrics/InternodeOutboundMetrics.java#L122

So we will need a relabeling rule that would catch these for IPv4 & IPv6 and put the target node as a label instead of allowing it to stay in the metricName.

cykl commented 1 month ago

We are using Cassandra 4.1.3 .

Where are the rules defined? There is this file https://github.com/k8ssandra/management-api-for-apache-cassandra/blob/f85032fa787a4e2a42a37ad16a94eeefa70ff907/management-api-agent-common/src/test/resources/collector-full.yaml (linked from the doc). Butsrc/test/resources leads me to believe that it's not the file that's being used.

burmanm commented 1 month ago

The default rules (you can define your own and they get appended to these) are in https://github.com/k8ssandra/management-api-for-apache-cassandra/blob/master/management-api-agent-common/src/main/resources/default-metric-settings.yaml

burmanm commented 1 month ago

And here's an example how to append your own rules (like the ones I added to the default):

CassandraDatacenter changes:

https://github.com/k8ssandra/cass-operator/blob/14ce1a95f613104234cd3dba9154c4a428cdfeb7/tests/testdata/default-single-rack-single-node-additional-volumesources.yaml#L25

And the mounted ConfigMap:

https://github.com/k8ssandra/cass-operator/blob/14ce1a95f613104234cd3dba9154c4a428cdfeb7/tests/testdata/configs/my-metrics-config.yaml#L8

k8ssandra / management-api-for-apache-cassandra

Dubious metric names in Prometheus #512