LimitExceededException with multiple running queries per application

chadlagore commented 4 years ago

Currently seeing rate limit exceptions for DescribeStreams API. The reason is that we have 15-20 queries running in one application, and the DescribeStreams API only accepts 10 QPS per account. The trigger time is 30s, so what I believe is happening is the following:

The kinesis-sql client notices that describeShardInterval has passed, and issues one API call per running query.
Only ~10 succeed.
retryIntervalMs later, the remaining calls get issued, some more succeed, and so on.
If numRetries is exhausted, the remainder fail.

Options we've explored:

Increasing describeShardInterval; did not work for obvious reasons, the thundering heard described above still takes place eventually.
Adding some jitter to the trigger intervals. I don't see a spark config for this, but could probably fake it on a per query basis.
Disabling getLatestShardInfo calls altogether; not actually ideal or possible, but we could do something like setting describeShardInterval to very large number to accomplish in the short term. Getting new shards would then require a redeploy.
Increase retryIntervalMs to >1s? The hypothesis being that this is small enough to cause some edge condition.
Requesting an increase in QPS on DescribeStreams (request submitted, unsure whether this will work)

Have you experienced this? I don't suppose it is possible to jitter the describeShardInterval per query? Is there another solution you could recommend? Happy to open a PR if it requires a change.

I believe this is unrelated to https://github.com/qubole/kinesis-sql/issues/50.

EDIT: Also now noticing that list-shards might now be more appropriate to use?

itsvikramagr commented 4 years ago

@chadlagore - thanks for bringing up the issue

list-shards will be a good alternative to describeShard. Would be great if you can open a PR for it.

In the absence of list-shards API, I can think of following ways (most of them you have already tried)

Know your sharding frequency. Accordingly, use describeShardInterval. 3600s (1 hr) would be a good value if you don't reshard too frequently.
Try to give different values of describeShardInterval to different streaming queries.
yeah, we can start with a higher value of retryIntervalMs say 2 secs. And have more retries - kinesis.client.numRetries - say 10 or something like that.

chadlagore commented 4 years ago

Thanks for the update. We've been able to increase this limit via AWS support for the time being - only to 20, but it gives us a bit of breathing room. We're migrated some of our other services to list-shards. I suspect we'll hit the limit again in the future, prior to which we'll probably have to make a change here.

chadlagore commented 4 years ago

Fix underway in https://github.com/qubole/kinesis-sql/issues/83

qubole / kinesis-sql

LimitExceededException with multiple running queries per application #83