redpanda-data / kminion

KMinion is a feature-rich Prometheus exporter for Apache Kafka written in Go. It is lightweight and highly configurable so that it will meet your requirements.
MIT License
610 stars 122 forks source link

KMinion v2.2.1 is segfaulting when serving `/metrics` #177

Closed hhromic closed 1 year ago

hhromic commented 1 year ago

First of all, thank you very much for KMinion. It is truly a useful diagnostics tool!

I just noticed a new version of KMinion released (v2.2.1) and I went to test it to migrate our existing v2.2.0 deployment. Unfortunately, the new version of KMinion is segfaulting whenever a client requests its /metrics endpoint. I tested this with two different Confluent Kafka clusters running version 6.2.7 and get the same result.

In one terminal, KMinion is started like this:

$ docker run --rm -p 8080:8080 -e KAFKA_BROKERS=kafka:9092 vectorized/kminion:v2.2.1
{"level":"info","ts":1670591408.6039164,"caller":"app/config.go:62","msg":"the env variable 'CONFIG_FILEPATH' is not set, therefore no YAML config will be loaded"}
{"level":"info","ts":"2022-12-09T13:10:08.604Z","logger":"main","msg":"started kminion","version":"v2.2.1","built_at":"2022-11-24"}
{"level":"info","ts":"2022-12-09T13:10:08.604Z","logger":"main","msg":"connecting to Kafka seed brokers, trying to fetch cluster metadata","seed_brokers":"kafka:9092"}
{"level":"info","ts":"2022-12-09T13:10:10.248Z","logger":"main","msg":"successfully connected to kafka cluster"}

When the /metrics endpoint is requested in another terminal, KMinion crashes after few seconds with the below error:

$ curl localhost:8080/metrics
curl: (52) Empty reply from server

Error on the KMinion terminal:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x7de84d]

goroutine 73 [running]:
github.com/twmb/franz-go/pkg/kgo.(*describeGroupsSharder).shard(0x800?, {0xcf0750?, 0xc000460100?}, {0xcf4828?, 0xc00001a9c0}, {0x7f2688649fff?, 0x429ea5?})
        /go/pkg/mod/github.com/twmb/franz-go@v1.10.0/pkg/kgo/client.go:2544 +0xa2d
github.com/twmb/franz-go/pkg/kgo.(*Client).handleShardedReq.func2({0x0, {0xcf4828, 0xc00001a9c0}, {0x0, 0x0}})
        /go/pkg/mod/github.com/twmb/franz-go@v1.10.0/pkg/kgo/client.go:1730 +0x151
github.com/twmb/franz-go/pkg/kgo.(*Client).handleShardedReq(0xc000278000, {0xcf0750?, 0xc000460100}, {0xcf4828?, 0xc00001a9c0})
        /go/pkg/mod/github.com/twmb/franz-go@v1.10.0/pkg/kgo/client.go:1821 +0x9f9
github.com/twmb/franz-go/pkg/kgo.(*Client).shardedRequest(0xc000278000, {0xcf07f8?, 0xc00011ac60?}, {0xcf4828?, 0xc00001a9c0})
        /go/pkg/mod/github.com/twmb/franz-go@v1.10.0/pkg/kgo/client.go:978 +0x691
github.com/twmb/franz-go/pkg/kgo.(*Client).RequestSharded(0xc000494000?, {0xcf07f8?, 0xc00011ac60?}, {0xcf4828?, 0xc00001a9c0?})
        /go/pkg/mod/github.com/twmb/franz-go@v1.10.0/pkg/kgo/client.go:909 +0x3a
github.com/cloudhut/kminion/v2/minion.(*Service).DescribeConsumerGroups(0xc000494000, {0xcf07f8, 0xc00011ac60})
        /app/minion/describe_consumer_groups.go:70 +0x15b
github.com/cloudhut/kminion/v2/prometheus.(*Exporter).collectConsumerGroups(0xc000360000, {0xcf07f8?, 0xc00011ac60?}, 0xcea010?)
        /app/prometheus/collect_consumer_groups.go:18 +0x5e
github.com/cloudhut/kminion/v2/prometheus.(*Exporter).Collect(0xc000360000, 0xc000392f60?)
        /app/prometheus/exporter.go:235 +0x205
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
        /go/pkg/mod/github.com/prometheus/client_golang@v1.14.0/prometheus/registry.go:456 +0x10d
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
        /go/pkg/mod/github.com/prometheus/client_golang@v1.14.0/prometheus/registry.go:467 +0x55d

Please let me know of any information you might need to help debugging this issue.

weeco commented 1 year ago

Hey @hhromic , thanks for filing the issue. We are not able to reproduce this issue and we are using this KMinion version for several Redpanda clusters.

Could you share the configuration you are using? Are you able to provide a docker compose where this is reproducible? Thanks!

hhromic commented 1 year ago

Hey @weeco , thanks for the fast reply and sorry for not replying earlier. I tried to construct a minimum reproducible example in a simple Docker compose file but (of course) all is working fine on a fresh Kafka cluster deployment. This crash is likely a corner case in our current Kafka cluster state.

I will try to debug a local copy of kminion given on how easy is for me to reproduce the crash with our current Kafka clusters. The crash happens actually in franz-go and not in kminion itself, so I likely will start debugging there first.

Maybe a fix is due for franz-go and kminion would just need to bump the dependency on that library. Will keep you posted on what I manage to find. Cheers!

hhromic commented 1 year ago

Good news, I found the problem now. Confluent's Control Center has the (bad) habit of using a consumer group with an empty-string as name/ID. franz-go is crashing on this for-loop when it encounters this empty-named group:

    for _, group := range req.Groups {
        berr := coordinators[group]
        var ke *kerr.Error
        switch {
        case berr.err == nil:
            brokerReq := brokerReqs[berr.b.meta.NodeID]   // crash is here
            if brokerReq == nil {
                brokerReq = newReq()
                brokerReqs[berr.b.meta.NodeID] = brokerReq
            }
            brokerReq.Groups = append(brokerReq.Groups, group)
        case errors.As(berr.err, &ke):
            kerrs[ke] = append(kerrs[ke], group)
        default:
            unkerrs = append(unkerrs, unkerr{berr.err, group})
        }
    }

I can confirm that when I delete the empty-string named consumer group, kminion does not crash anymore. This is obviously a bug in franz-go and not kminion, therefore I will continue this issue in their repository. I will leave this issue here open to notify you when this is fixed and released upstream.

hhromic commented 1 year ago

@weeco upstream franz-go fixed the issue now (very quickly) :) I re-built KMinion locally with latest franz-go version v1.10.4 and I can confirm no more crashes now with our cluster :) All is working as expected again:

$ curl -s localhost:8080/metrics | grep -F 'group_id=""'
kminion_kafka_consumer_group_info{coordinator_id="2",group_id="",protocol="",protocol_type="",state="Empty"} 0
kminion_kafka_consumer_group_members{group_id=""} 0
kminion_kafka_consumer_group_topic_lag{group_id="",topic_name="test-topic"} 0
kminion_kafka_consumer_group_topic_offset_sum{group_id="",topic_name="test-topic"} 5.478263e+06
kminion_kafka_consumer_group_topic_partition_lag{group_id="",partition_id="0",topic_name="test-topic"} 0

So all is needed now should be for kminion to bump franz-go and this should be fixed.