Open andrewreineke opened 4 years ago
This is an interesting observation, which caused me to have to re-think some of my assumptions regarding how the join/sync barrier works, so thanks for that.
To set the scene, the way that we've thought about rebalances and the offset commit mechanism is that as soon as a rebalance is necessary, we drop all ongoing work and re-join and sync as soon as possible. Essentially we regarded rebalances as a hard barrier where the best thing was to get the group into sync as soon as possible. As you noted, this can lead to unnecessary reprocessing of messages, and while we knew there were things we could do to improve this, we didn't think the additional complexity was with it. What we were primarily thinking about was ways to continue from the same offset where you left off, assuming that you were still assigned the same partitions after a rebalance, but without committing.
The tricky thing about committing after a rebalance is detected is that you don't know if you are still assigned the same partitions. By the time you realize that the group is rebalancing, the group might have already kicked you out, at which point any commit you make would fail. Now, we could of course handle this, but we didn't design any of the interfaces to accommodate for that, so I don't know off the top of my head if we could do it in a way that makes sense. The Java client seems to do it like this, which I find a bit strange, but then I'm not an expert on the Java client design.
BUT, this is just the first part of the issue, and one that you seem to have handled in user-land (even though I think maybe we should try to handle it inside KafkaJS instead, now that I know it's a possibility).
I think your diagnosis is correct. The finally
block will indeed be executed as a microtask, which would get executed only after the next scheduleFetch
is synchronously executed. Wrapping all the calls to scheduleFetch
in a setImmediate
would solve this, as it would then be scheduled after the finally
block. It's a little brittle, in my opinion, but I also don't see an immediately (heh) better way of handling it.
We are having the same issue where we are detecting a rebalance in the middle of processing a batch, and so we commit the last handled offset and exit the batch. We observed that the commit did not take any effect but somehow none of the commit function calls we made threw an error. Maybe a useful change would be to make that function throw if it is not able to commit because of a rebalance
Describe the bug DISCLAIMER: This could 100% be abuse/misuse of the library, and I'd be happy if this was a case of RTFM, so hopefully someone can point out something I'm doing obviously wrong 😜
I'm attempting to use a combination of
eachBatch
andcommitOffsetsIfNecessary
in order to commit resolved offsets in the face of a rebalance (detected by try-catchingheartbeat()
and matchingerr.type === "REBALANCE_IN_PROGRESS"
).This works well until
consumer.disconnect()
is called, at which point the OffsetCommit request fails because the consumer has already issued LeaveGroup. (The coordinator is not aware of this member
)This results in frequent, unnecessary redeliveries on rebalances.
I believe this is due to a race condition on the
runner.consuming
bit, but I'm not positive.Below is a repo with more detailed code/repro steps, but the pared down version of my
eachBatch
function is:To Reproduce https://github.com/andrewreineke/kjs-rebalance-test (Setup/repro steps in README)
Expected behavior When gracefully disconnecting such a consumer, the resolved offsets should be committed successfully on teardown.
Observed behavior When
consumer.disconnect()
is called, the OffsetCommit request fails because the consumer has already issued LeaveGroup. (The coordinator is not aware of this member
)Environment:
Additional context My understanding of what's happening is:
consumer.disconnect()
is calledrunner.stop()
waitForConsumer()
this.consuming
to decide whether to wait to leave the group or not. In my case,this.consuming
is false so it leaves immediately, thereby preventing the success of any subsequent calls tocommitOffsets
.Why is
this.consuming === false
? I manually instrumented runner.js with a getter/setter with logging to understand what's going on. I found that after a rebalance occurs and new partition assignments are received, the flag is being set to true and then immediately set false. (Log excerpt below).I think what's happening is
this.consuming = false
in the finally block ofrunner.scheduleFetch()
is a promise microtask that gets scheduled afterthis.consuming = true
at the top of scheduleFetch() whenscheduleFetch()
is fire-and-forget invoked in the catch blockPossible Solution
I have verified that the CommitOffsets request succeeds when wrapping all recursive calls to
scheduleFetch()
insetImmediate(() => this.scheduleFetch())
, which seems to ensure that thefinally
block completes before touchingthis.consuming
on the next iteration. But I'm not sure this is the optimal solution (and maybe this is working as intended per the intent ofrunner.consuming
and I'm misinterpreting)