Open panda87 opened 7 years ago
I am seeing the same behavior, it's so bad that prometheus is skipping it because the scrape to the /metrics endpoint is taking to long. Seems to be related to the number of partitions
pznamensky branch and this fixed that
@panda87 Would you mind sharing a link to his fix/branch and I can look into bringing it into master
here? If you could submit a PR that would be even better! 🙏
I am seeing the same behavior, it's so bad that prometheus is skipping it because the scrape to the /metrics endpoint is taking to long.
@zot42 FYI, you can increase that timeout in Prometheus, though.
Do you know why this happens?
@panda87 Would you mind measuring how long it takes to list our your topics using kafka-consumer-groups.sh
as well as querying the lag? I've also created #47 which would help diagnose issues like yours.
Would you mind sharing a link to his fix/branch and I can look into bringing it into master here? If you could submit a PR that would be even better!
@panda87 Never mind. Please ignore my comment, I just saw his PR. ;)
@JensRantil I used his PR, but I still get errors like this:
goroutine 1290463 [running]:
panic(0x7f2880, 0xc420012080)
/usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka.(*regexpParser).parseLine(0xc4200acee0, 0xc4207680b1, 0xe1, 0xc4200ef400, 0x0, 0x10)
/go/src/github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka/parsing.go:134 +0x4d2
github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka.(*regexpParser).Parse(0xc4200acee0, 0xc420768000, 0xed1, 0xc420126e80, 0x77, 0x0, 0x0, 0x0, 0xa51ce0, 0xc420552550)
/go/src/github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka/parsing.go:74 +0x189
github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka.(*DelegatingParser).Parse(0xc4200fb120, 0xc420768000, 0xed1, 0xc420126e80, 0x77, 0x3, 0xc420768000, 0xed1, 0xc420126e80, 0x77)
/go/src/github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka/parsing.go:211 +0x8f
github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka.(*ConsumerGroupsCommandClient).DescribeGroup(0xc4200e8c60, 0xa58460, 0xc42011d080, 0xc4200777a0, 0xc, 0x1, 0x13bb, 0x0, 0xc420502757, 0xc)
/go/src/github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka/collector.go:79 +0x105
github.com/kawamuray/prometheus-kafka-consumer-group-exporter/sync.(*FanInConsumerGroupInfoClient).describeLoop.func1(0xc420011780, 0xc4200777a0, 0xc, 0xc42011d0e0, 0xa58460, 0xc42011d080, 0xc42011d140)
/go/src/github.com/kawamuray/prometheus-kafka-consumer-group-exporter/sync/metrics.go:159 +0x71
created by github.com/kawamuray/prometheus-kafka-consumer-group-exporter/sync.(*FanInConsumerGroupInfoClient).describeLoop
/go/src/github.com/kawamuray/prometheus-kafka-consumer-group-exporter/sync/metrics.go:164 +0x178
time="2017-11-13T07:35:37Z" level=warning msg="unable to find current offset field. line: counter_raw_events 0 1 1 0 event-stream-StreamThread-2-consumer-11d0f14e-4b96-41a7-ac49-eb389ca75e58/172.40.102.56 event-stream-StreamThread-2-consumer" source="parsing.go:110"
time="2017-11-13T07:35:37Z" level=warning msg="unable to parse int for lag. line: %scounter_raw_events 0 1 1 0 event-stream-StreamThread-2-consumer-11d0f14e-4b96-41a7-ac49-eb389ca75e58/172.40.102.56 event-stream-StreamThread-2-consumer" source="parsing.go:118"
time="2017-11-13T07:35:37Z" level=warning msg="unable to find current offset field. Line: counter_raw_events 0 1 1 0 event-stream-StreamThread-2-consumer-11d0f14e-4b96-41a7-ac49-eb389ca75e58/172.40.102.56 event-stream-StreamThread-2-consumer" source="parsing.go:125"
time="2017-11-13T07:35:37Z" level=warning msg="unable to parse int for current offset. Line: %scounter_raw_events 0 1 1 0 event-stream-StreamThread-2-consumer-11d0f14e-4b96-41a7-ac49-eb389ca75e58/172.40.102.56 event-stream-StreamThread-2-consumer" source="parsing.go:130"
panic: runtime error: index out of range
@panda87 That looks like a different issue than this. Please open a new issue (and specify which version/commit you are running).
Ok, I will create new issue
It seems that now after I pulled this last repo with latest changes I dont get the errors above, so thanks! Now its only the response time, which is still high
@panda87 Good! I know I saw that error when I was recently revamping some of the parsing logic.
I noticed the response time being high too....I wonder if this is actually kafka who is taking a long time to run vs prometheus....
I noticed the response time being high too....I wonder if this is actually kafka who is taking a long time to run vs prometheus....
I'm pretty sure it is. #47 will help us tell whether that's the case.
any update on this one?
Unfortunately not. Pull requests are to fix #47. I've been pretty busy lately and haven't had time to get back to this 😥
Any update on this?
Unfortunately not.
Might be worth mentioning that I had a colleague that claimed lag is now exposed through JMX. A workaround might be to have a look at using jmx_exporter instead of this.
wait what? that'd be awesome if it is! do you know which kafka version? to clarify...its always been there on the consumer side but not on the server side as far as I know
@k1ng87,
If you're interested in consumer lag it's published via JMX by the consumer:
Old consumer: kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+)
New consumer: kafka.consumer:type=consumer-fetch-manager-metrics,client-id={client-id} Attribute: records-lag-max
Replication lag is published by the broker:
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=([-.\w]+),topic=([-.\w]+),partition=([0-9]+)
See the official Kafka documentation for more details: https://kafka.apache.org/documentation/#monitoring. I checked only version 1.0, the latest one as of now. Hope this helps.
Hi
Im using this plugin for a while, and it worked pretty well while I had small amount of consumers. Today, I added many other consumers and new topics and 2 things started to appear
max-concurrent-group-queries
to 10 - it just effected my CPU cores and increased the load to 500%Do you know why this happens?
D.