Closed kylege closed 1 month ago
The current code is deliberate. If a consumer in a group is assigned a topic, then if that topic/partition has no commit, it means the entire topic/partition is lagged.
The original implementation did not consider topics with no commits as candidates to have lag. This was problematic because people were trying to calculate lag for a group that was bugged and never actually committed. The lag showed as zero, but technically by definition of how lag is calculated, the calculation should have shown the entire topic lagging.
The current code is deliberate. If a consumer in a group is assigned a topic, then if that topic/partition has no commit, it means the entire topic/partition is lagged.
The original implementation did not consider topics with no commits as candidates to have lag. This was problematic because people were trying to calculate lag for a group that was bugged and never actually committed. The lag showed as zero, but technically by definition of how lag is calculated, the calculation should have shown the entire topic lagging.
We have too may topics that cleanup.policy=delete, so the topics will have zero messages after some time, when a consumer starts, it will never commit offset because no messages consumed, so the current code will calculate a very big lag number other than zero.
So how can i fix this scenario? Should i implement the calculation myself like kowl did?
Maybe we can get partition water marks to get how many messages in topic, but this method requires more requests to kafka server.
We've actually had multiple reports of Kowl's (now Console) lag page being inaccurate -- for basically the reason that I do it differently in the code above.
However, I think given what you're saying, a different fix is to subtract the log start offset from the lag. If a topic has log end offset 30 and log start offset 30 and no commit, I think today it'll show a lag of 30, whereas this can be changed to show a lag of 0.
Check out #744, I think that solves this. I'm going to release kadm soon.
If topic has no commit offset for a consumer group, then we should not calculate lag for it.
For example, topic A current high water offset is 100, but has no message because of delete mode
And there will not be any offset commits for consumer group because there is no messages to consume
Function CalculateGroupLag will calculate lag 100 for the consumer group, which is wrong.