kawamuray / prometheus-kafka-consumer-group-exporter

Prometheus exporter for Kafka consumer group information obtained through `kafka-consumer-groups.sh`.
Apache License 2.0
40 stars 38 forks source link

Very long query time with many topics and consumer groups #41

Open panda87 opened 7 years ago

panda87 commented 7 years ago

Hi

Im using this plugin for a while, and it worked pretty well while I had small amount of consumers. Today, I added many other consumers and new topics and 2 things started to appear

  1. I started to get failures due to long consumer_ids (I used pznamensky branch and this fixed that) - can you pls merge, btw
  2. The http query time increased from 2-3 seconds to 20 seconds, even when I changed max-concurrent-group-queries to 10 - it just effected my CPU cores and increased the load to 500%

Do you know why this happens?

D.

zot42 commented 7 years ago

I am seeing the same behavior, it's so bad that prometheus is skipping it because the scrape to the /metrics endpoint is taking to long. Seems to be related to the number of partitions

JensRantil commented 7 years ago

pznamensky branch and this fixed that

@panda87 Would you mind sharing a link to his fix/branch and I can look into bringing it into master here? If you could submit a PR that would be even better! 🙏

JensRantil commented 7 years ago

I am seeing the same behavior, it's so bad that prometheus is skipping it because the scrape to the /metrics endpoint is taking to long.

@zot42 FYI, you can increase that timeout in Prometheus, though.

JensRantil commented 7 years ago

Do you know why this happens?

@panda87 Would you mind measuring how long it takes to list our your topics using kafka-consumer-groups.sh as well as querying the lag? I've also created #47 which would help diagnose issues like yours.

JensRantil commented 7 years ago

Would you mind sharing a link to his fix/branch and I can look into bringing it into master here? If you could submit a PR that would be even better!

@panda87 Never mind. Please ignore my comment, I just saw his PR. ;)

panda87 commented 7 years ago

@JensRantil I used his PR, but I still get errors like this:

goroutine 1290463 [running]:
panic(0x7f2880, 0xc420012080)
    /usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka.(*regexpParser).parseLine(0xc4200acee0, 0xc4207680b1, 0xe1, 0xc4200ef400, 0x0, 0x10)
    /go/src/github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka/parsing.go:134 +0x4d2
github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka.(*regexpParser).Parse(0xc4200acee0, 0xc420768000, 0xed1, 0xc420126e80, 0x77, 0x0, 0x0, 0x0, 0xa51ce0, 0xc420552550)
    /go/src/github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka/parsing.go:74 +0x189
github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka.(*DelegatingParser).Parse(0xc4200fb120, 0xc420768000, 0xed1, 0xc420126e80, 0x77, 0x3, 0xc420768000, 0xed1, 0xc420126e80, 0x77)
    /go/src/github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka/parsing.go:211 +0x8f
github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka.(*ConsumerGroupsCommandClient).DescribeGroup(0xc4200e8c60, 0xa58460, 0xc42011d080, 0xc4200777a0, 0xc, 0x1, 0x13bb, 0x0, 0xc420502757, 0xc)
    /go/src/github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka/collector.go:79 +0x105
github.com/kawamuray/prometheus-kafka-consumer-group-exporter/sync.(*FanInConsumerGroupInfoClient).describeLoop.func1(0xc420011780, 0xc4200777a0, 0xc, 0xc42011d0e0, 0xa58460, 0xc42011d080, 0xc42011d140)
    /go/src/github.com/kawamuray/prometheus-kafka-consumer-group-exporter/sync/metrics.go:159 +0x71
created by github.com/kawamuray/prometheus-kafka-consumer-group-exporter/sync.(*FanInConsumerGroupInfoClient).describeLoop
    /go/src/github.com/kawamuray/prometheus-kafka-consumer-group-exporter/sync/metrics.go:164 +0x178
time="2017-11-13T07:35:37Z" level=warning msg="unable to find current offset field. line: counter_raw_events             0          1               1               0          event-stream-StreamThread-2-consumer-11d0f14e-4b96-41a7-ac49-eb389ca75e58/172.40.102.56                 event-stream-StreamThread-2-consumer" source="parsing.go:110"
time="2017-11-13T07:35:37Z" level=warning msg="unable to parse int for lag. line: %scounter_raw_events             0          1               1               0          event-stream-StreamThread-2-consumer-11d0f14e-4b96-41a7-ac49-eb389ca75e58/172.40.102.56                 event-stream-StreamThread-2-consumer" source="parsing.go:118"
time="2017-11-13T07:35:37Z" level=warning msg="unable to find current offset field. Line: counter_raw_events             0          1               1               0          event-stream-StreamThread-2-consumer-11d0f14e-4b96-41a7-ac49-eb389ca75e58/172.40.102.56                 event-stream-StreamThread-2-consumer" source="parsing.go:125"
time="2017-11-13T07:35:37Z" level=warning msg="unable to parse int for current offset. Line: %scounter_raw_events             0          1               1               0          event-stream-StreamThread-2-consumer-11d0f14e-4b96-41a7-ac49-eb389ca75e58/172.40.102.56                 event-stream-StreamThread-2-consumer" source="parsing.go:130"
panic: runtime error: index out of range
JensRantil commented 7 years ago

@panda87 That looks like a different issue than this. Please open a new issue (and specify which version/commit you are running).

panda87 commented 7 years ago

Ok, I will create new issue

panda87 commented 7 years ago

It seems that now after I pulled this last repo with latest changes I dont get the errors above, so thanks! Now its only the response time, which is still high

JensRantil commented 7 years ago

@panda87 Good! I know I saw that error when I was recently revamping some of the parsing logic.

cl0udgeek commented 6 years ago

I noticed the response time being high too....I wonder if this is actually kafka who is taking a long time to run vs prometheus....

JensRantil commented 6 years ago

I noticed the response time being high too....I wonder if this is actually kafka who is taking a long time to run vs prometheus....

I'm pretty sure it is. #47 will help us tell whether that's the case.

cl0udgeek commented 6 years ago

any update on this one?

JensRantil commented 6 years ago

Unfortunately not. Pull requests are to fix #47. I've been pretty busy lately and haven't had time to get back to this 😥

cl0udgeek commented 6 years ago

Any update on this?

JensRantil commented 6 years ago

Unfortunately not.

JensRantil commented 6 years ago

Might be worth mentioning that I had a colleague that claimed lag is now exposed through JMX. A workaround might be to have a look at using jmx_exporter instead of this.

cl0udgeek commented 6 years ago

wait what? that'd be awesome if it is! do you know which kafka version? to clarify...its always been there on the consumer side but not on the server side as far as I know

raindev commented 6 years ago

@k1ng87,

If you're interested in consumer lag it's published via JMX by the consumer:

Old consumer: kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+)

New consumer: kafka.consumer:type=consumer-fetch-manager-metrics,client-id={client-id} Attribute: records-lag-max

Replication lag is published by the broker:

kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=([-.\w]+),topic=([-.\w]+),partition=([0-9]+)

See the official Kafka documentation for more details: https://kafka.apache.org/documentation/#monitoring. I checked only version 1.0, the latest one as of now. Hope this helps.