twmb / franz-go

franz-go contains a feature complete, pure Go library for interacting with Kafka from 0.8.0 through 3.6+. Producing, consuming, transacting, administrating, etc.
BSD 3-Clause "New" or "Revised" License
1.61k stars 158 forks source link

fix: Skip calculate lag when a topic has no offset commits #718

Closed kylege closed 1 month ago

kylege commented 2 months ago

If topic has no commit offset for a consumer group, then we should not calculate lag for it.

For example, topic A current high water offset is 100, but has no message because of delete mode

And there will not be any offset commits for consumer group because there is no messages to consume

Function CalculateGroupLag will calculate lag 100 for the consumer group, which is wrong.

twmb commented 2 months ago

The current code is deliberate. If a consumer in a group is assigned a topic, then if that topic/partition has no commit, it means the entire topic/partition is lagged.

The original implementation did not consider topics with no commits as candidates to have lag. This was problematic because people were trying to calculate lag for a group that was bugged and never actually committed. The lag showed as zero, but technically by definition of how lag is calculated, the calculation should have shown the entire topic lagging.

kylege commented 2 months ago

The current code is deliberate. If a consumer in a group is assigned a topic, then if that topic/partition has no commit, it means the entire topic/partition is lagged.

The original implementation did not consider topics with no commits as candidates to have lag. This was problematic because people were trying to calculate lag for a group that was bugged and never actually committed. The lag showed as zero, but technically by definition of how lag is calculated, the calculation should have shown the entire topic lagging.

We have too may topics that cleanup.policy=delete, so the topics will have zero messages after some time, when a consumer starts, it will never commit offset because no messages consumed, so the current code will calculate a very big lag number other than zero.

So how can i fix this scenario? Should i implement the calculation myself like kowl did?

Maybe we can get partition water marks to get how many messages in topic, but this method requires more requests to kafka server.

twmb commented 2 months ago

We've actually had multiple reports of Kowl's (now Console) lag page being inaccurate -- for basically the reason that I do it differently in the code above.

However, I think given what you're saying, a different fix is to subtract the log start offset from the lag. If a topic has log end offset 30 and log start offset 30 and no commit, I think today it'll show a lag of 30, whereas this can be changed to show a lag of 0.

twmb commented 1 month ago

Check out #744, I think that solves this. I'm going to release kadm soon.