mhart / kinesalite

An implementation of Amazon's Kinesis built on LevelDB
MIT License
808 stars 86 forks source link

Add limits to get records #87

Open kailashhd opened 5 years ago

kailashhd commented 5 years ago

Adding a test for this now. Early changes to check if the approach looks good.

kailashhd commented 5 years ago

@mhart Request for a review.

mhart commented 5 years ago

At first glance it looks... a bit odd? Like, sumOfBytes is just going to increase each time you call the function – so if you make two requests with 3MB then the second one will go over. Also, I'm not sure what mathjs is for? And in general I prefer to keep the number of dependencies down – so would prefer not to depend on object-sizeof either.

Also, it would be good to have a test that is one byte under 5MB and one that is just one byte over – so both paths can be tested.

kailashhd commented 5 years ago

Thank you very much for the response. I end up resetting the sumOfBytes towards the end of the readStream so after the getRecords call returns, the sumOfBytes will be reset to zero.

Yes I agree it is good to have limited dependencies. But could not have a good way to calculate the size of an object without using object-sizeof(Nodejs noob here). The mathjs is pulled in to the dependencies for the pow function (I can remove that).

I will add the tests and will request for a review soon. Thanks a lot for your time reviewing this.

mhart commented 5 years ago

Ah, I see – yeah, JavaScript has a built-in global Math object, so Math.pow is available without needing extra libraries: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Math/pow

In any case, I prefer 5 * 1024 * 1024, I think it's easier to read.

Leave object-sizeof in there for now, I'll have a look at the PR when you're done with it and see if it's worth including, or whether a simple function would do.

In terms of your sumOfBytes comment – if two read streams are occurring at once, then it won't be correct. (JS is single-threaded, but streams are asynchronous and chunked, so multiple streams could be active at once). Best to just have sumOfBytes as a bound variable in the function.

If you address the tests, I can clean up the JS code as best I can.

kailashhd commented 5 years ago

I added the tests as requested. I was also able to remove both the object and math dependencies. I am not sure if I am handling all the possible exception correctly or not. Do let me know if I can improve that.

Also do i need to keep the package-lock consistent to the previous version? (I maybe using the newer libraries and ended up creating a new package-lock file).

Also would it make sense for me to squash and create 1 single commit?

kailashhd commented 5 years ago

@mhart Apologies for the delay in addressing code review comments. Request for review.

kailashhd commented 5 years ago

@mhart Gentle remainder.

kailashhd commented 5 years ago

@mhart Gentle remainder.

mhart commented 5 years ago

@kailashhd I'm struggling to reproduce this on production – I'm just reading over the documentation now at https://docs.aws.amazon.com/kinesis/latest/APIReference/API_GetRecords.html and it states "each shard can read up to 2 MiB per second" and "The maximum size of data that GetRecords can return is 10 MiB"

So I'm not sure where the 5MiB figure came from? When I run your test (on production) it only retrieves one record, not five – I'm guessing that's because it's running into the 2MiB per second limit.

kailashhd commented 5 years ago

@mhart Apologies. I must have used the limits from Put records: https://docs.aws.amazon.com/kinesis/latest/APIReference/API_PutRecords.html instead of the value from the get-records. That's where the 5 MB comes from.

So my tests should be returning 5 records (atleast that's what I am asserting) in 1 single response. It should not hit the 2 MB limit because put_records ensure that each record is a maximum of 1 MB. (Andin leveldb with kinesalite we don't change the records at all).

That's the reason I have included other test will have a total of 15 records where it will be returned over 3 different responses.

mhart commented 5 years ago

When you put those records into stream on Kinesis (I mean, on a real AWS Kinesis stream) – and then retrieve them with GetRecords – you're saying you get all five back? I only get one back.

kailashhd commented 5 years ago

This was my test: 1) Created a bunch of records of size ~1 MB 2) Used the following aws CLI command to get shard iterator aws kinesis get-shard-iterator --stream-name <test-stream> --shard-id shardId-000000000000 --shard-iterator-type TRIM_HORIZON 3) Got records aws kinesis get-records --shard-iterator <shard-iterator>. In this case I got all the 5 records at 1 go.

mhart commented 5 years ago

You need to try with --no-paginate otherwise the AWS CLI will automatically keep making GetRecords calls until NextShardIterator is empty

kailashhd commented 5 years ago

Strangely even with the --no-paginate, I am still able to get all the 5 records at 1 go. I also tried to produce 5 additional records (total of 10 records) and did a get-records. Still got only 5 records (as I was close to the 10 MiB mark with 5 records).

My aws version: aws-cli/1.16.70 Python/3.7.1 Darwin/17.7.0 botocore/1.12.60 I wonder if we are using different versions.

mhart commented 5 years ago

Hmmm, strange – maybe it's an issue with the PutRecords call in your test then? Maybe it's only putting one record? I'll double check tomorrow.

mhart commented 5 years ago

Ah yeah, that's the issue. Here's the return from the PutRecords call:

[ { SequenceNumber: '49594189147623985939251953501666965394405037144469929986',
    ShardId: 'shardId-000000000000' },
  { ErrorCode: 'ProvisionedThroughputExceededException',
    ErrorMessage:
     'Rate exceeded for shard shardId-000000000000 in stream __kinesalite_test_415476646818213 under account XXXXXXXXXXX.' },
  { ErrorCode: 'ProvisionedThroughputExceededException',
    ErrorMessage:
     'Rate exceeded for shard shardId-000000000000 in stream __kinesalite_test_415476646818213 under account XXXXXXXXXXX.' },
  { ErrorCode: 'ProvisionedThroughputExceededException',
    ErrorMessage:
     'Rate exceeded for shard shardId-000000000000 in stream __kinesalite_test_415476646818213 under account XXXXXXXXXXX.' },
  { ErrorCode: 'ProvisionedThroughputExceededException',
    ErrorMessage:
     'Rate exceeded for shard shardId-000000000000 in stream __kinesalite_test_415476646818213 under account XXXXXXXXXXX.' } ]
kailashhd commented 5 years ago

Ha I did not realize that was the issue. The max 5MB request size for put records defined here: https://docs.aws.amazon.com/cli/latest/reference/kinesis/put-records.html works only if each shard gets a max of 1 MB per second. Since we are testing with 1 shard, this is causing problems. Let me try to make changes to the test to do sequential put_record instead of put_records tomorrow. Will also change it to be lessthan10Mb instead of 5MB.