micrometer-metrics / micrometer

An application observability facade for the most popular observability tools. Think SLF4J, but for observability.
https://micrometer.io
Apache License 2.0
4.49k stars 993 forks source link

Static descriptions for Kafka consumer metrics #4952

Closed pirgeo closed 6 months ago

pirgeo commented 7 months ago

Please describe the feature request. The Kafka instrumentation produces dynamic descriptions, meaning that descriptions are not always the same for one metric key. Here is a list of some descriptions that I have come across in the past:

metric name description
kafka.consumer.fetch.manager.bytes.consumed.total The total number of bytes consumed
kafka.consumer.fetch.manager.bytes.consumed.total The total number of bytes consumed for a topic
kafka.consumer.fetch.manager.records.consumed.rate The average number of records consumed per second
kafka.consumer.fetch.manager.records.consumed.rate The average number of records consumed per second for a topic
kafka.consumer.fetch.manager.records.lag.max The max lag of the partition
kafka.consumer.fetch.manager.records.lag.max The maximum lag in terms of number of records for any partition in this window. NOTE: This is based on current offset and not committed offset
kafka.consumer.fetch.manager.fetch.size.max The maximum number of bytes fetched per request
kafka.consumer.fetch.manager.fetch.size.max The maximum number of bytes fetched per request for a topic
kafka.consumer.fetch.manager.records.lead.min The min lead of the partition
kafka.consumer.fetch.manager.records.lead.min The minimum lead in terms of number of records for any partition in this window

Additional context There are prior examples using a static description per metric key, see #3201 and more recently #4864. I am not entirely sure how these metrics are created, but I think they are originating from this class: https://github.com/micrometer-metrics/micrometer/blob/main/micrometer-core/src/main/java/io/micrometer/core/instrument/binder/kafka/KafkaConsumerMetrics.java

CC @izeye, since you worked on #4864, do you think this is a reasonable change?

izeye commented 7 months ago

The KafkaConsumerMetrics has been deprecated and doesn't seem to be possible to produce "The total number of bytes consumed for a topic" as description, for example.

They seem to have been produced from the KafkaClientMetrics and have come from org.apache.kafka.common.MetricName.description().

Looking into the KafkaClientMetrics, the pairs won't exist at the same time as when a more specific meter is registered, a less specific one will be unregistered. So with this arrangement, it seems that Prometheus would be okay, but Dynatrace could be affected.

See https://kafka.apache.org/documentation/#consumer_fetch_monitoring

pirgeo commented 7 months ago

So you are saying that there should not be a situation where two meters with different descriptions are registered at the same time? I assume you are referring to checkAndBindMetrics. I've dug through that method, and I think I understand how it works:

Get metrics currently registered in Kafka, and remove the ones that are in Kafka but no longer in MM: https://github.com/micrometer-metrics/micrometer/blob/73f88795f427407e81edc17ff2fa0da8bba47dbd/micrometer-core/src/main/java/io/micrometer/core/instrument/binder/kafka/KafkaMetrics.java#L187-L201

Split metrics stored in MM by metric name: https://github.com/micrometer-metrics/micrometer/blob/73f88795f427407e81edc17ff2fa0da8bba47dbd/micrometer-core/src/main/java/io/micrometer/core/instrument/binder/kafka/KafkaMetrics.java#L203-L207

Then, iterate the metrics coming from Kafka and compare the number of tags on the metric in Kafka, and the metric in Micrometer. https://github.com/micrometer-metrics/micrometer/blob/73f88795f427407e81edc17ff2fa0da8bba47dbd/micrometer-core/src/main/java/io/micrometer/core/instrument/binder/kafka/KafkaMetrics.java#L209-L241

If I understand correctly, that means that it is possible to register two metrics with the same name and same number of tags, but only if the tags are not the same to the ones already there. I might be reading it wrong though. Not sure if that could be the reason for two Meters with the same name but different descriptions.

As for the Dynatrace registry: as soon as the meter is removed from the registry, it will no longer be exported. Checking for conflicting descriptions happens per-export, so I am pretty sure that at the time of exporting, there are two meters with the same name but different descriptions. I'll try to reproduce it, but I don't know when I'll get around to it. Since I am not really familiar with Kafka, I am not even sure where to start.

izeye commented 7 months ago

As for the Dynatrace registry: as soon as the meter is removed from the registry, it will no longer be exported. Checking for conflicting descriptions happens per-export, so I am pretty sure that at the time of exporting, there are two meters with the same name but different descriptions.

@pirgeo Sorry, I thought Dynatrace required the same description for the same meter during an application's lifetime.

I thought two different descriptions came from a meter without/with "topic" dimension, for example, and the latter would survive in the end. However, based on your comment, I might miss something as the two descriptions seem to exist at the same time for some reason.

pirgeo commented 7 months ago

@izeye Thank you for your response!

I thought Dynatrace required the same description for the same meter during an application's lifetime.

The discrepancy is currently checked per-export and per-metric-name. This is all on the client side, though, and the description will be checked against the one already stored for the metric name upon ingestion.

Either way, the Kafka instrumentation produces multiple descriptions for the same metric key. I guess I understand the rationale in this case, although it almost seems to me as if kafka.consumer.fetch.manager.bytes.consumed.total and kafka.consumer.fetch.manager.bytes.consumed.total with a topic dimension should be two separate metrics. This is the first time I have ever come across a situation where the meaning of a metric changes during the runtime while the name stays the same. But I assume that ship has sailed, I am sure the folks on the Kafka side have their reasons for doing it that way.

As far as I can tell, there must be two separate Meters registered with the MeterRegistry at the same time, one with the topic dimension and one without. I am assuming that based on the fact that I get two different description texts in the same export ("The total number of bytes" vs. "The total number of bytes consumed for a topic"). In the Dynatrace exporter, we collect which meter names map to which descriptions (per export), and then warn if there is two conflicting descriptions.

shakuzen commented 7 months ago

I think the intention of the logic to remove the meter with less tags is to avoid the situation you seem to be running into. There may be a bug or a concurrency issue. Kafka metrics are unique in the way they work and we rely on whatever the Kafka client is providing in terms of metrics. It'd be good to get the name and tags of the overlapping meters when this happens as a first step. Then maybe we can figure out how to reproduce it and what is happening.

pirgeo commented 7 months ago

I think the intention of the logic to remove the meter with less tags is to avoid the situation you seem to be running into.

Yeah, that's what I assume, too. Unfortunately, I don't have access to the tags, and I couldn't reproduce it, so I am left guessing, but I'll try to find out!

shakuzen commented 7 months ago

Until we figure out which metrics are causing the issue, I'm not sure we'll be able to make a change we're sure fixes things. If we do get that info, please share here.

github-actions[bot] commented 7 months ago

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

github-actions[bot] commented 6 months ago

Closing due to lack of requested feedback. If you would like us to look at this issue, please provide the requested information and we will re-open.

pirgeo commented 6 months ago

Thanks for considering this. I currently simply don't have the time to look into this more thoroughly, but I guess that there might be multiple registered metrics (probably for multiple registered Kafka consumers in different states) that are producing metrics with different descriptions. I'll reopen this once I have more info.

shakuzen commented 6 months ago

I guess that there might be multiple registered metrics (probably for multiple registered Kafka consumers in different states) that are producing metrics with different descriptions.

That sounds likely to be the issue. I'm not sure what we can do about it, though, since we are taking the descriptions provided by the Kafka client. To solve it on the Micrometer side, it seems like we would need to maintain some kind of mapping of known metrics that need to use a different description. That sounds quite fragile and we don't own the descriptions - the Kafka client does. They may be different in different versions of the client and that would be a nightmare to manage on our side. I'm wondering if the Kafka client would be willing to change the descriptions to avoid this issue.