Deal with Dynamo write throughput scaling during batch writes

gcv commented 5 years ago

We need to deal with the spiky loads we get when a new report adds a slew of new gene requirements, causing potentially very large imports to occur (adding 60 new genes means 60×N writes, where N is the number of users in the system!).

In addition to increasing table throughput, we may need to batch update Lambda invocations. The places where this needs to happen are marked with TODO: Split into pieces before calling? in bioinformatics.

Dynamo auto-scaling may not react quickly enough. According to this article, Dynamo auto-scaling is implemented as a CloudWatch alarm which takes up to 15 minutes to react. This will not work for us, as we have to contend with short Lambda timeouts doing Dynamo writes (300sec maximum).

A Serverless plugin for doing some of this exists, but it may not create a sufficiently aggressive CloudWatch scaling alarm.

Another article which covers dynamo scaling: https://medium.com/rue-la-la-tech/how-rue-la-la-bulk-loads-into-dynamodb-ad1469af578e

gcv commented 5 years ago

Proposed solution:

No auto-scaling.
Figure out what throughput we need to achieve some reasonable write rate (1000 base entries per second?).
Increase Dynamo write throughput before running an update.
Run the update.
Decrease Dynamo write throughput.

Need to check that Dynamo alerts us when write throughput is set to high. Otherwise it can get awfully expensive if the process fails to set the write throughput back to low.

aneilbaboo commented 5 years ago

Copying solution proposed in https://github.com/precisely/web/issues/363: Doing blind Lambda-based writes to Dynamo is unsustainable for larger numbers of users and bases referenced in reports. We need to transition to performing all Dynamo writes, throttling, and throughput scaling by means of a queue.

This can probably be done without breaking existing code. Lambdas responsible for Dynamo writes will instead enqueue the needed operations. Another process will take care of dequeuing and performing the actual writes, and can also handle Dynamo scaling. When the queue begins to grow to the point that the process can no longer sustain it without causing a shard to occur, it can alert us so we can figure out what to do.

aneilbaboo commented 5 years ago

WORK IN PROGRESS

There are 2 separate issues

WRITE CAPACITY: Since each Dynamo shard can only handle a maximum provisioned throughput of 1,000.
Now, SINCE we're partitioning on user ids, each user is on a single shard, so when we're uploading data for a user, we CANNOT exceed 1,000 writes PER second.
USAGE COST: Since Dynamo only allows 27 downscaling events per day (4 any time + 1 after an hour since the last downscaling event), we need to batch variantCall writes together to ensure that we end up with efficient usage of the table

Here is an equation that describes how many shards you'll have in Dynamo (from https://cloudonaut.io/dynamodb-pitfall-limited-throughput-due-to-hot-partitions/ ):

MAX( (Provisioned Read Throughput / 3,000), (Provisioned Write Throughput / 1,000), (Used Storage / 10 GB))

SOLUTION:

Batch all writes to the VariantCall table
- every user is placed into a single queue
Put user Id in in SQS queue
- new user after initial upload
A process runs every 2 hours
- scale up write capacity
- write all data
- scale down write capacity
- TODO: what happens if process takes longer than 2 hours to complete?
Note: Throttle writes so that < 1,000 writes per second hit DynamoDB for each user
- in principle, we could parallelize writes for multiple users such that multiple lambdas are spun up, up to N, where N = maximum write throughput / 1000. (Which provides one process per user where each process can write at up to 1000 rows per second)
- we don't have to parallelize for the first version

aneilbaboo commented 5 years ago

Or maybe we need AWS Data Pipeline: https://docs.amazonaws.cn/en_us/amazondynamodb/latest/developerguide/DynamoDBPipeline.html

precisely / web

Deal with Dynamo write throughput scaling during batch writes #345