Open dyanneo opened 3 weeks ago
Pinging code owners:
receiver/kafka: @pavolloffay @MovieStoreGuy
See Adding Labels via Comments if you do not have permissions to add labels yourself.
Hi @dyanneo,
Based on my initial analysis:
Cleanup is run at the end of a session, once all ConsumeClaim goroutines have exited but before the offsets are committed for the very last time.
Cleanup(ConsumerGroupSession) error
We have to be potentially careful here with using Cleanup func to solve this issue. By the time Cleanup is called, the session's claims may no longer reflect the partitions that were assigned, or session.Claims() might be empty. We may need to check that assumption as well...
That said... I agree with your approach as long as we are guaranteed on Cleanup(), that the session is accurately populated, we can do something like this:
for topic, partitions := range session.claims() {
for _, partition := range partitions {
c.telemetryBuilder.KafkaReceiverPartitionClose.Add(session.Context(), 1, metric.WithAttributes(
attribute.String(attrInstanceName, c.id.Name()),
attribute.String(attrTopic, topic),
attribute.String(attrPartition, strconv.Itoa(int(partition))),
))
// add cleanup for _offset_lag_ metric here as well.
}
}
@StephanSalas Thanks for your inputs on this issue. I appreciate your concerns and the suggestion makes sense to me.
According to the docs, this is the lifecycle of the serama kafka consumer framework:
// The life-cycle of a session is represented by the following steps:
//
// 1. The consumers join the group (as explained in https://kafka.apache.org/documentation/#intro_consumers)
// and is assigned their "fair share" of partitions, aka 'claims'.
// 2. Before processing starts, the handler's Setup() hook is called to notify the user
// of the claims and allow any necessary preparation or alteration of state.
// 3. For each of the assigned claims the handler's ConsumeClaim() function is then called
// in a separate goroutine which requires it to be thread-safe. Any state must be carefully protected
// from concurrent reads/writes.
// 4. The session will persist until one of the ConsumeClaim() functions exits. This can be either when the
// parent context is canceled or when a server-side rebalance cycle is initiated.
// 5. Once all the ConsumeClaim() loops have exited, the handler's Cleanup() hook is called
// to allow the user to perform any final tasks before a rebalance.
// 6. Finally, marked offsets are committed one last time before claims are released.
https://github.com/IBM/sarama/blob/main/consumer_group.go#L23C1-L36C87
Thus it looks likely that we can do it in Cleanup(), except for one noteable edge case:
// Please note, that once a rebalance is triggered, sessions must be completed within
// Config.Consumer.Group.Rebalance.Timeout. This means that ConsumeClaim() functions must exit
// as quickly as possible to allow time for Cleanup() and the final offset commit. If the timeout
// is exceeded, the consumer will be removed from the group by Kafka, which will cause offset
// commit failures.
// This method should be called inside an infinite loop, when a
// server-side rebalance happens, the consumer session will need to be
// recreated to get the new claims.
TODO: We may need to somehow figure out this edge case... will look into it. Initial idea is to use the Errors() func channel listed here: https://github.com/IBM/sarama/blob/main/consumer_group.go#L87C2-L87C28
Component(s)
receiver/kafka
What happened?
We wanted to collect, graph, and alert on lag for the kafka receiver, but observed unexpected behavior when observing the
otelcol_kafka_receiver_offset_lag
'slast
values, compared to the values observed using kafka's consumer-groups utility.Description
The value of
last
for the measurementotelcol_kafka_receiver_offset_lag
does not appear to be calculated correctly. Also, for context, we are seeing an issue in the otel collector where it keeps emitting lag metrics for partitions it's no longer consuming.Steps to Reproduce
last
value, we were surprised to see values that didn't correspond to kafka utility outputs.Expected Result
Per this screen shot, the value of partition 4's lag as shown using kafka's consumer-groups.sh utility changes over time, and is in the low hundreds or smaller:
Actual Result
Per this screen shot, the value of partition 4's lag does not match what true lag is, per kafka's tools:
Query in Grafana query builder:
Query as Grafana is running it:
Collector version
otelcol_version: 0.109.0
Environment information
Environment
OS: linux container
OpenTelemetry Collector configuration
Log output
Additional context
We think the reason for the incorrect data is the fact that the gauge exists within OTEL's registry even after a rebalance and the metric is not receiving updates
Where this gauge is defined:
https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/0d28558da65cbce23963906dbb3205fa2f383c0c/receiver/kafkareceiver/internal/metadata/generated_telemetry.go#L75-L79
One of the places where this gauge is updated:
https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/0d28558da65cbce23963906dbb3205fa2f383c0c/receiver/kafkareceiver/kafka_receiver.go#L560
Where this could be addressed:
https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/0d28558da65cbce23963906dbb3205fa2f383c0c/receiver/kafkareceiver/kafka_receiver.go#L529-L531