openownership / register

A demonstration transnational register of beneficial ownership data from the UK, Denmark, Slovakia and Armenia
https://register.openownership.org
GNU Affero General Public License v3.0
18 stars 4 forks source link

PSC-STM-B3: Finish PSC Kinesis Stream Transformer #251

Open tiredpixel opened 9 months ago

tiredpixel commented 9 months ago

The main work recently (and monthly import) have involved running the bulk transformer, which transforms from the S3 files produced from buffering the Kinesis stream.

This means the app which consumes from Kinesis directly hasn’t been run or updated recently.

Estimate: 4 hours

tiredpixel commented 7 months ago

The existing code works for consuming from multiple Kinesis shards. However, the manner by which it does that isn't optimal:

There are a number of considerations with this approach:

  1. Despite sleeping being according to Kinesis recommendations, so as not to read the stream too frequently, it means catching up a shard takes longer, even for iterations which return no records.
  2. Interleaving between shards like this means that by the time the first shard is returned to, the iterator might have timed out (which happens after 5m).
  3. There isn't a separate thread for each shard (contrary to Kinesis recommendations), so processing these records cannot be done in parallel.
  4. It's not possible to start a separate process to consume only one shard, so processing these records cannot be done in parallel and the program cannot be scaled horizontally.

Extending to support additional threads or processes likely wouldn't be too much work; however, multi-threading hasn't always been smooth-sailing with existing bulk data (i.e. non-stream) transformations, and I'm concerned this could lead to more conflicts when writing to Elasticsearch resulting in program crashes.

Despite these limitations, the existing approach is likely good enough for us at present, because we're using only a single shard per stream, and even a single shard is able to cope with a far higher throughput than we're able to cope with, given how long it takes to process each statement. Not only that, but using multiple shards affects event order, and this would have to be considered carefully for our use case, especially given that statements are generally order-sensitive.

tiredpixel commented 7 months ago

Kinesis Quotas and Limits

Data stream throughput

Provisioned mode

There is no upper limit. Maximum throughput depends on the number of shards provisioned for the stream. Each shard can support up to 1 MB/sec or 1,000 records/sec write throughput or up to 2 MB/sec or 2,000 records/sec read throughput. […]

https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html

This is approximately 3 orders of magnitude faster on write than we're currently utilising.

tiredpixel commented 7 months ago

In keeping with recent work on other parts of the program, I'm not writing extra formal tests for this. I am not convinced of the benefit of doing so, especially as keeping to the previous pattern would result in calls to Kinesis and other external services being stubbed (i.e. not actually executed live) anyway. I note there are some existing tests checking some overall calls, but extending these would be significant work, and I'm unpersuaded about the merit of doing so considering other details of the project, codebase, and roadmap.