mhart / kinesalite

An implementation of Amazon's Kinesis built on LevelDB
MIT License
808 stars 86 forks source link

Fix NextShardIterator for compatibility with Kinesis Client Library (KCL) #34

Closed whummer closed 8 years ago

whummer commented 8 years ago

First of all, thanks for publishing this project @mhart - kinesalite is really awesome! :)

This PR is about a minor compatibility issue with Kinesis Client Library (KCL) [1]. In general, the combination of kinesalite and dynalite works really well for local integration testing with KCL.

We have one test case, however, where KCL works properly with AWS Kinesis but fails to work with kinesalite. The test consists of 1 stream, 1 producer which constantly pushes records to the stream, and 1 consumer (KCL process) which continuously reads records. In parallel, we slowly increase the shard allocation by running a sequence of SplitShard API calls.

By constantly polling the stream description, the KCL figures out that new shards are added. The handover from the "old" shard (parent) to the "new" shards (children) happens as soon as the KCL determines that the end of the old shard has been reached. The responsible piece of code can be found here [2], essentially KCL checks whether NextShardIterator is null, then sets isShardEndReached to true and starts initializing the new child shards from that point onwards. The problem is that the condition nextIterator == null is never satisfied with kinesalite, at least not in our test setup (the issue is reproducible).

The change in this PR fixes this issue and allows KCL to reliably switch to the new shards after a split operation. See also the comment in the code which explains why the change is necessary.

It is hard to tell at this point which implications this change has for other bits of the framework, but our test results indicate that this actually best mimics the behavior of Kinesis. Judging from the unit test results of this project, this is a non-breaking change as all test cases are still succeeding.

Please consider accepting this change as it covers a crucial functionality for a majority of users (KCL is probably the most widely used tool to interact with Kinesis). If you figure out a better solution, I'll be happy to test it against our KCL testbed. Thanks

[1] https://github.com/awslabs/amazon-kinesis-client [2] https://github.com/awslabs/amazon-kinesis-client/blob/c6e393c13ec348f77b8b08082ba56823776ee48a/src/main/java/com/amazonaws/services/kinesis/clientlibrary/lib/worker/KinesisDataFetcher.java#L73

mhart commented 8 years ago

@whummer interesting – I think if it's never null then there's a bug, and that should be fixed.

IIRC the reason I added this special check is because there are (or at least were) scenarios where, for a brief period after the stream is closed (ie, it has an EndingSequenceNumber), Kinesis would still continue to pass back non-null NextShardIterators for a short while (I believe until the time encoded in the iterator was past the final sequence number) – and I wanted to make sure I replicated this functionality. It may be that they've fixed that bug, or that the scenarios are more complicated than I thought (highly likely).

I'm not going to merge as is (no point having a true || check) – I'll either get rid of the check entirely, or will see if I can figure out what was causing that behaviour. Will get back to you soon, thanks!

mhart commented 8 years ago

Yeah, I've found the parts of my tests where I would check for that, based on testing against live Kinesis instances:

https://github.com/mhart/kinesalite/blob/master/test/splitShard.js#L344-L351

Kinesis will continue to put records in the "closed" shard until the time has clocked over to the new shard's create time

https://github.com/mhart/kinesalite/blob/master/test/splitShard.js#L381-L385

Sometimes returns with a NextShardIterator, sometimes doesn't

https://github.com/mhart/kinesalite/blob/master/test/splitShard.js#L417-L418

Can continue to return a number of times with a NextShardIterator after shard is closed

I'll have to see if this is still the case.

mhart commented 8 years ago

@whummer also, is there a short snippet of code that you can show that can reproduce this? Even if it's just pseudocode that shows the API calls being made and the various options being passed (like ShardIteratorType)

Because as far as I can tell, it should start to return an empty NextShardIterator at some point – otherwise this test would never pass:

https://github.com/mhart/kinesalite/blob/master/test/splitShard.js#L396-L427

whummer commented 8 years ago

Thanks for your quick response. Of course, the PR in its current form is merely for illustration.

I believe that the test case in https://github.com/mhart/kinesalite/blob/master/test/splitShard.js#L396-L427 hits a different part of the code because this one continuously calls GetShardIterator (hits actions/getShardIterator.js), whereas KCL continuously calls GetRecords (hits actions/getRecords.js) and uses the NextShardIterator contained in the response.

I'll try to put together a self-contained minimal working example, hopefully in the next couple of days. Thanks

mhart commented 8 years ago

Ah, yeah, of course – good to know on the GetRecords vs GetShardIterator distinction

mhart commented 8 years ago

Ah, I think I've spotted the issue...

nextSeq will always be the same thing, because lastItem doesn't exist:

var nextSeq = lastItem ? db.incrementSequence(lastItem._seqObj) : seqNo

So the generated NextShardIterator from createShardIterator on the next line will always have the same embedded sequence number – and seqObj is parsed out of this sequence number which means seqObj.seqTime >= endSeqObj.seqTime will never change.

At least... I think that may be the issue.

mhart commented 8 years ago

@whummer I think I've fixed this – can you checkout the increment-seq-correctly branch and let me know if that works for you?

mhart commented 8 years ago

Changes here: https://github.com/mhart/kinesalite/compare/increment-seq-correctly

whummer commented 8 years ago

@mhart yep that branch works. I'll let you know if we run into any other issues, but for now all looks great. Thanks for fixing this so quickly!

mhart commented 8 years ago

Great, thanks for trying that @whummer – released as v1.11.5