redpanda-data / kminion

KMinion is a feature-rich Prometheus exporter for Apache Kafka written in Go. It is lightweight and highly configurable so that it will meet your requirements.
MIT License
610 stars 122 forks source link

Any way of getting estimated consumer lag in seconds in promql? #182

Open sebw91 opened 1 year ago

sebw91 commented 1 year ago

Kminion works great, thank you.

Anyone have a way of computing an estimated consumer time lag in promql?

I think we'd have to somehow join two series, kminion_kafka_consumer_group_topic_offset_sum and kafka_topic_high_water_mark_sum.

Conceptually the query should be something along the lines of: time() - time_at_value(kafka_topic_high_water_mark_sum, kminion_kafka_consumer_group_topic_offset_sum + 1)

Where time_at_value is a method of getting the timestamp of a series at a value. Not something that exists in prometheus.

weeco commented 1 year ago

Hey @sebw91 , yes an approximate time lag is possible and I totally support that. The lag should really be exported in time, because this is what users really want to know.

I thought about how to solve this in the past already and I had different ideas. There's one exporter that uses interpolation, see: https://github.com/seglo/kafka-lag-exporter for more information. It's a bigger effort to implement this and currently I don't plan to spend this amount of time on kminion. If you are interested in trying this I'd suggest to come up with a proposal that we can discuss here, before starting with the implementation. It's not trivial to implement it so that it can scale in larger clusters though and that would be a requirement for KMinion.

sebw91 commented 1 year ago

Thanks a lot for the info. I was hoping it would be possible to do something in promql. From what I can see the data we need is all there to compute a very rough estimate. I would be fine without interpolation, just a lower bound on the topic_high_water_mark_sum to get a timestamp for consumer offset would be sufficient. This may not be possible though.

weeco commented 1 year ago

Oh I see what you mean. You are saying the information when a certain high water mark in a partition was reached is stored in Prometheus already (at least up to the retention), so that you put the intepolation logic into the PromQL somehow.

That's indeed a good idea! I'm not sure whether it's possibly with the available PromQL functions, but definetely worth a try to give that idea a try!

sebw91 commented 1 year ago

I think there is a way (kinda), using offset promql modifier! If the hwm of a topic 5 minutes ago is greater than the current offset_sum in consumer group, then we can determine that we are at least 5 minutes behind, for example:

`kminion_kafka_topic_high_water_mark_sum offset 5m > on (topic_name) kminion_kafka_consumer_group_topic_offset_sum + 1

I will continue exploring on my side, but this should do the trick for us.

hhromic commented 1 year ago

This is an interesting request/subject. As mentioned already, Kafka has no notion of consumer lag in time units (seconds) itself. Probably because it actually depends on how fast a consumer can/is consuming a given partition. More in general, the current/expected throughput for consumption by a consumer.

For this reason, we approximate the consumer lag (all-partitions mode) in seconds using the consumer rate like this:

sum(kminion_kafka_consumer_group_topic_lag{job=~"$job",group_id=~"$group_id"})
  by (group_id,topic_name) / on (topic_name)
  group_left sum(rate(kminion_kafka_topic_high_water_mark_sum{job=~"$job"} [$__rate_interval]))
    by (group_id,topic_name)

This is used in Grafana, hence the usage of $__rate_interval which can be replaced by a static rate() range.

Hope this is useful somehow :)

sebw91 commented 1 year ago

@hhromic That's a clever prom query - very useful. Thanks very much. I think this is accurate enough for my use case.