Open tiredpixel opened 9 months ago
The existing code works for consuming from multiple Kinesis shards. However, the manner by which it does that isn't optimal:
There are a number of considerations with this approach:
Extending to support additional threads or processes likely wouldn't be too much work; however, multi-threading hasn't always been smooth-sailing with existing bulk data (i.e. non-stream) transformations, and I'm concerned this could lead to more conflicts when writing to Elasticsearch resulting in program crashes.
Despite these limitations, the existing approach is likely good enough for us at present, because we're using only a single shard per stream, and even a single shard is able to cope with a far higher throughput than we're able to cope with, given how long it takes to process each statement. Not only that, but using multiple shards affects event order, and this would have to be considered carefully for our use case, especially given that statements are generally order-sensitive.
Data stream throughput
Provisioned mode
There is no upper limit. Maximum throughput depends on the number of shards provisioned for the stream. Each shard can support up to 1 MB/sec or 1,000 records/sec write throughput or up to 2 MB/sec or 2,000 records/sec read throughput. […]
https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html
This is approximately 3 orders of magnitude faster on write than we're currently utilising.
In keeping with recent work on other parts of the program, I'm not writing extra formal tests for this. I am not convinced of the benefit of doing so, especially as keeping to the previous pattern would result in calls to Kinesis and other external services being stubbed (i.e. not actually executed live) anyway. I note there are some existing tests checking some overall calls, but extending these would be significant work, and I'm unpersuaded about the merit of doing so considering other details of the project, codebase, and roadmap.
The main work recently (and monthly import) have involved running the bulk transformer, which transforms from the S3 files produced from buffering the Kinesis stream.
This means the app which consumes from Kinesis directly hasn’t been run or updated recently.
Estimate: 4 hours