redpanda-data / kminion

KMinion is a feature-rich Prometheus exporter for Apache Kafka written in Go. It is lightweight and highly configurable so that it will meet your requirements.
MIT License
610 stars 122 forks source link

Feature request: metric for time of last message produced in a topic #187

Open hhromic opened 1 year ago

hhromic commented 1 year ago

Confluent's Control Center (since version 6.2.0) implemented Improved Topic Inspection via Last-Produced Timestamp.

From: https://www.confluent.io/blog/better-kafka-management-with-improved-topic-inspection-in-confluent-part-3/#how-it-works

The “Topics” overview page gives a summary—health, throughput—of all the topics for a cluster. Confluent Platform 6.2.0 introduces a new column, “Last produced,” which reports the timestamp of the latest message produced to each topic. Using this report, you can easily compare and identify the topics that have not been produced in a long time.

This feature is actually quite useful and we think that such a feature could be very nice for kminion as well in the form of a topic/partition gauge-type metric to indicate the time of last message produced for a topic. For example:

# HELP kminion_kafka_topic_partition_last_produced_seconds Timestamp (seconds since Unix Epoch) of the last message produced for a given partition in a topic
# TYPE kminion_kafka_topic_partition_last_produced_seconds gauge
kminion_kafka_topic_partition_last_produced_seconds{partition_id="0",topic_name="__consumer_offsets"} 1675792649

The linked blog post from Confluent describes the approach to obtain this metric using an internal consumer, which sounds quite feasible to implement in kminion and its internal consumer as well. But given that this metric requires constantly consuming messages from multiple partitions/topics, there are performance considerations to keep in mind.

Hope you are interested! If you are, maybe I can dedicate some time to put together a POC.

TheMeier commented 1 year ago

I guess the ending of the metric should be _timestamp_seconds https://prometheus.io/docs/practices/naming/ I read the docs you linked. The process you describe is quite involved and I wonder how much overhead that produces.

Because KafkaConsumer::poll() could timeout in the case of dormant topics, obtaining the last-produced timestamp is a relatively expensive API call

In any case I would secure such a feature with a feature toggle and maybe a topic black/whitelist.

hhromic commented 1 year ago

I guess the ending of the metric should be _timestamp_seconds https://prometheus.io/docs/practices/naming/

Ah yes! I do follow that good practices document often, forgot that there is a specific case for timestamps. 👍

I read the docs you linked. The process you describe is quite involved and I wonder how much overhead that produces.

Yes, I have been thinking about it and indeed is more complex than it looks like. And I agree that users of this feature probably would want to enable topic filtering.

In any case I would secure such a feature with a feature toggle and maybe a topic black/whitelist.

Yes, agreed.

Currently I'm lacking a bit of time to attempt a POC implementation, so if anybody else feels like giving it a try, please go ahead. Otherwise I will try to get more familiar with kminion's codebase and see what can I POC.