influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.11k stars 5.51k forks source link

kafka_consumer: expose consumer group lag as internal metric #11231

Open hackery opened 2 years ago

hackery commented 2 years ago

Feature Request

Proposal:

Add the lag of the consumer group specified in [[inputs.kafka_consumer]] into the telegraf [[inputs.internal]] metrics.

Current behavior:

The input can lag with no indication of this exposed.

Desired behavior:

When [[inputs.internal]] is enabled, the plugin adds selfstat items for the consumer group lag (other metrics might also be useful to add at this point). Sample output:

internal_kafka_consumer,instance=xxxx,consumer_group=tg-0,partition=0 current_offset=x,log_end_offset=y,lag=z 1654079199000000000

Use case:

When a kafka consumer drops behind, it can be hard to diagnose. Kafka's own API does not expose consumer group offset metrics (they're stored in the offsets topics) and one might resort to the CLI tools, e.g.

GROUP TOPIC            PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG   CONSUMER-ID                                   HOST      CLIENT-ID
tg-0  metrics-hosepipe 0          4562002220      4562002452      232   Telegraf-41eef470-f8fc-402a-9e1f-41b50ac153ed /1.2.3.4  Telegraf
tg-0  metrics-hosepipe 1          4561999766      4561999985      219   Telegraf-41eef470-f8fc-402a-9e1f-41b50ac153ed /1.2.3.4  Telegraf 

While calls to the above could be wrapped in a script and called from Telegraf, the consumer input itself is in a better position to collect these metrics in context, apply tags etc.

reimda commented 2 years ago

Hi @hackery, it sounds like exposing these kafka stats through inputs.internal would be a helpful tool to shed light on kafka behavior. Are you able to put together a PR to add this functionality?

I'm not sure these metrics are available through the kafka consumer library telegraf uses, https://github.com/Shopify/sarama. There is a recent feature request in that project to add more consumer metrics, including lag: https://github.com/Shopify/sarama/issues/2235 Are you familiar with sarama enough to confirm whether it can provide the metrics you're interested in?

hackery commented 2 years ago

I would love to work on this, although yes, it may need that Sarama work completing first - I shall have a look at whether I could take that on as well.

sigurd-cp commented 1 month ago

Do you know if there is any progress on this topic?