Closed chadlagore closed 4 years ago
I removed the condition and tested it here. Once the stream sees records again, it immediately catches up. The iterator age grows as expected, but as soon as records show up, it moves back to the tip. Do you see any danger in removing this condition @itsvikramagr? I don't think we should assume we'll always see at-least one record on a given micro batch.
This also appears to have a significant impact on checkpoint recovery. In production we see tasks undergoing checkpoint recovery take up 15mins to fetch their first batch, in stage this is just a few mins (but the trigger is 30s, so it's still bad). I suspect it's happening because there are more millisBehindLatest
to catch up in those situations. After removing the condition, checkpoint recovery fell under the trigger window on stage. I have not used the fix in production yet.
@chadlagore - the condition was added as part of this PR - https://github.com/qubole/kinesis-sql/pull/49/files.
Will re-look at the code and go through your use-case to see what we can do here.
Thank you @itsvikramagr
The library gets stuck pulling records from Kinesis for very long periods of time when the stream is empty. By looking at the debug output, I have confirmed it is due to this condition always being true when the stream is empty (we lose the
lastSequenceNumber
and never regain it due to emptiness).As a result of the prolonged pulling, we run into Kinesis read limits which further throttles. In this loop for example, it hits the Kinesis API ~250 times to accomplish a micro-batch on a 30s trigger (which can take a few mins to do). Eventually,
getMillisBehindLatest
becomes 0 and the loop can move forward. For reference, we use a fallback stream in one region that is normally empty, and union it with another region which normally has data. The full region finishes in about 5-10s, the empty region runs for minutes sometimes.getMillisBehindLatest
to move forward? The Kinesis API seems to move it slightly forward on every call.t0
returns empty, then it is safe to move the timestamp forwardt0 + getMillisBehindLatest
to the tip of the stream?avoidEmptyBatches
be taken into consideration somewhere in that condition? Is there another config we could use to circumvent this?Example select log lines in the loop for a single micro-batch (library makes 257 pull attempts):
Exceptions raised (86 within micro batch):
Current version:
spark-sql-kinesis_2.11-1.1.2-spark_2.4