jirwin / burrow_exporter

Prometheus exporter for burrow
Apache License 2.0
77 stars 68 forks source link

Export partition status #11

Closed ercliou-zz closed 6 years ago

ercliou-zz commented 6 years ago

I'd like to export the status of each partition too. We can always write some logic at prometheus end, but Burrow already does this well. https://github.com/linkedin/Burrow/wiki/http-request-consumer-group-status These are the valid status strings: NOTFOUND, OK, WARN, ERR, STOP, STALL

Edit: We shall model them as separate time series

kafka_burrow_partition_state{cluster="MY_CLUSTER",group="MY_GROUP",partition="13",topic="MY_TOPIC",state:"OK"} 1
kafka_burrow_partition_state{cluster="MY_CLUSTER",group="MY_GROUP",partition="13",topic="MY_TOPIC",state:"STOP"} 1

something like https://www.robustperception.io/exposing-the-software-version-to-prometheus/

kanga333 commented 6 years ago

@ercliou Hello. I also want this metrics. You seem to have made some changes after forking, but are you planning to send a patch upstream?

ercliou-zz commented 6 years ago

I ended up implementing by sending all metrics at every scrap. When the status is not the matched one, it sends 0. This increases 1:5 with number of partitions (could be a problem if you have a lot of them). e.g.

kafka_burrow_partition_state{cluster="MY_CLUSTER",group="MY_GROUP",partition="13",topic="MY_TOPIC",state:"OK"} 1
kafka_burrow_partition_state{cluster="MY_CLUSTER",group="MY_GROUP",partition="13",topic="MY_TOPIC",state:"STOP"} 0
kafka_burrow_partition_state{cluster="MY_CLUSTER",group="MY_GROUP",partition="13",topic="MY_TOPIC",state:"REWIND"} 0
kafka_burrow_partition_state{cluster="MY_CLUSTER",group="MY_GROUP",partition="13",topic="MY_TOPIC",state:"STALL"} 0
kafka_burrow_partition_state{cluster="MY_CLUSTER",group="MY_GROUP",partition="13",topic="MY_TOPIC",state:"WARN"} 0

This is so each one of them stay as one independent time series. The reason of this is that I could query the lag + status at Grafana by partition. Query:

kafka_burrow_partition_lag{group="MY_GROUP",topic="MY_TOPIC"}
* on (topic, partition, group) group_left(status) 
(kafka_burrow_partition_status{group="MY_GROUP",topic="MY_TOPIC"} == 1)

I could send a patch if @jirwin agrees with this :)

jirwin commented 6 years ago

I'm +1 to this. Partition count isn't generally unbound. Maybe it could be enabled by a command line flag, so people can use their own judgement as to whether the surge in new time series is acceptable to them. Maybe --per-partition-stats or something?

shibug commented 6 years ago

How about we define a numeric scheme for the value of this time series? This will save us from 1:5 time series bloat. Our system has 2525 partitions for 52 topics. I am definitely worried about the bloat.

NOTFOUND = 1 OK = 2 WARN = 3 ERR = 4 STOP = 5 STALL = 6

kafka_burrow_partition_state{cluster="MY_CLUSTER",group="MY_GROUP",partition="13",topic="MY_TOPIC"} 2

ercliou-zz commented 6 years ago

Hi @shibug , I explained a lil bit about the reasoning behind in the above PR (centered mostly around Grafana).

We have 15k partitions and haven't encountered performance problems (yet). I can't look into command line flag right now, if someone would like to look into this, I appreciate it.

jirwin commented 6 years ago

Fixed by #19.