linkedin / Burrow

Kafka Consumer Lag Checking
Apache License 2.0
3.73k stars 797 forks source link

Feature request - Prometheus metrics support #318

Open varun06 opened 6 years ago

varun06 commented 6 years ago

Now that #149 is closed. Can we decide the right approach and add prometheus metrics support to project.

dmitryilyin commented 6 years ago

There is an ongoing effort to make telegraf's input plugin for burrow API here https://github.com/influxdata/telegraf/pull/3489

It can then be used to send metrics to Graphite, InfluxDB, Prometheus and others.

varun06 commented 6 years ago

@dmitryilyin does that mean "use telegarf and burrow both" or that effort can be used to add prometheus support to burrow itself?

solsson commented 6 years ago

@dmitryilyin What advantages to you see in exporting via Telegraf?

dmitryilyin commented 6 years ago

Yes, it means using them both. Adding Prometheus format metrics to burrow is indeed useful, but other people will (and already do) want Graphite output, others are writing InfluxDB connector, and there are many more monitoring systems around.

On the other hand, telegraf works as a swiss army knife. It has a lot of input plugins https://github.com/influxdata/telegraf/tree/master/plugins/inputs and can be easily extended by exec reporter scripts, so it can gather metrics and receive metrics form a lot of things, including gathering system metrics much better then Prometheus' node_exporter, which you should be using, right. It can output metrics to a lot of things too, including Prometheus and Graphite https://github.com/influxdata/telegraf/tree/master/plugins/outputs. Although different metric formats and styles can complicate things.

The Prometheus style is to have a lot of different exporters and/or integrate metrics gathering to applications and push gateway for scripts. Which approach is better? Who knows.

If you have only Prometheus and not going to integrate with anything else, then, perhaps, you don't need telegraf at all and can use burrow_exporter or integrate metrics into burrow itself, or maybe you can try telegraf instead if you do need to talk to many other systems.

Anyway, adding Prometheus metrics directly to burrow will be helpful. It will also allow to use telegraf's, Prometheus protocol supports on Burrow instead of using burrow's API. Will it be better remains to be seen.

varun06 commented 6 years ago

That make sense, but yeah adding prom support to burrow going to be helpful too.

solsson commented 6 years ago

I think it should be noted that exporting to Prometheus doesn't come with the usual complexities of maintaining an integration. It's an HTTP endpoint, nothing else. Very much like the GET endpoints in the /v3 API, but with plaintext instead of JSON.

It'd be great if the discussion for how to map the current responses to Prometheus labels took place in this repo. It affects how useful the exported metrics are for consumer lag monitoring.

If you have only Prometheus and not going to integrate with anything else, then, perhaps, you don't need telegraf at all and can use burrow_exporter or integrate metrics into burrow itself, or maybe you can try telegraf instead if you do need to talk to many other systems.

Using burrow_exporter is ok, though it adds a delay (unless its polling is perfectly synced with Prometheus pull) and some overhead. It too needs a discussion on mapping to labels. Is anyone interested in helping out with https://github.com/jirwin/burrow_exporter/pull/9, i.e. support for the current API version?

solsson commented 6 years ago

This is the a sample metric I get out of burrow_exporter after my v3 search-and-replace:

# HELP kafka_burrow_topic_partition_offset The latest offset on a topic's partition as reported by burrow.
# TYPE kafka_burrow_topic_partition_offset gauge
kafka_burrow_topic_partition_offset{cluster="local",partition="12",topic="__consumer_offsets"} 2428

I think these labels make sense.

I had a quick look at the source to try to get the lag export working, but instead of spending time on the structs there... Could anyone hint on how to get hold of these data structures https://github.com/linkedin/Burrow/wiki/Templates#data-in-templates inside Burrow instead, whenever they change?

solsson commented 6 years ago

An argument for an external exporter might be that it can do actual integrations without adding to Burrow complexity. For example it could look up owner IPs from partition info in the Kubernetes API, to tag metrics with an optional owner_pod_name.

I think the exporter is ok with v3 since https://github.com/jirwin/burrow_exporter/pull/9#issuecomment-358938429. See sample export there. I think the labels are good, and they'll be forward compatible even if more labels are added later.

Xaelias commented 5 years ago

One of the big drawbacks of an external integration like the burrow exporter linked here. Is that it has its own scrape interval. On top of prometheus scrape interval. Like mentioned above, prometheus metrics are just a plaintext representation of what burrow has. Having that inside burrow shouldn't be a whole lot of complexity. I would rather also not have to rely on 2/3/... projects just to track kafka lags :-D

Xaelias commented 5 years ago

Oh also the burrow exporter is actually bugged. It looks like the maintainer is not responsive (although they might respond later hopefully). And I just don't have the go expertise to fix the net/http code myself so...

shamil commented 5 years ago

One of the big drawbacks of an external integration like the burrow exporter linked here. Is that it has its own scrape interval. On top of prometheus scrape interval.

This is fixed in my fork, mostly full refactor (except burrow client), I'm now using custom collector implementation, which means scrape happens on demand when /mertrics endpoint scraped by prometheus: https://github.com/shamil/burrow_exporter