snowplow / snowplow-s3-loader

Mirrors a Kinesis stream to Amazon S3 using the KCL
http://snowplowanalytics.com
42 stars 38 forks source link

Eliminate possibility of OutOfMemory errors from scaling Kinesis Shards #250

Open jbeemster opened 2 years ago

jbeemster commented 2 years ago

Currently the S3 Loader buffers operate on a per-shard basis. This means that if your rotation buffer is 64mb each shard allocated to the consumer can consume this much memory. If the number of shards suddenly scales you run the risk of needing not 64mb of memory for this buffer but instead N x MaxByteBuffer - let alone overhead for the JVM and processing in general being done.

This behavior makes it impossible to auto-scale consistently as you never know how much memory an individual consumer might end up needing.

istreeter commented 2 years ago

A good way to fix this is to rewrite the loader as a fs2 app, using fs2-aws as the kinesis source. Using that library, the shards would share a single buffer, so memory usage should be approximately constant even as the number of input shards increases.

This would be a big rewrite of the app, but it's a change we want to do anyway, and it is consistent with how we are now writing all other Snowplow apps.