Open gcv opened 5 years ago
Proposed solution:
Need to check that Dynamo alerts us when write throughput is set to high. Otherwise it can get awfully expensive if the process fails to set the write throughput back to low.
Copying solution proposed in https://github.com/precisely/web/issues/363: Doing blind Lambda-based writes to Dynamo is unsustainable for larger numbers of users and bases referenced in reports. We need to transition to performing all Dynamo writes, throttling, and throughput scaling by means of a queue.
This can probably be done without breaking existing code. Lambdas responsible for Dynamo writes will instead enqueue the needed operations. Another process will take care of dequeuing and performing the actual writes, and can also handle Dynamo scaling. When the queue begins to grow to the point that the process can no longer sustain it without causing a shard to occur, it can alert us so we can figure out what to do.
WORK IN PROGRESS
There are 2 separate issues
WRITE CAPACITY: Since each Dynamo shard can only handle a maximum provisioned throughput of 1,000.
Now, SINCE we're partitioning on user ids, each user is on a single shard, so when we're uploading data for a user, we CANNOT exceed 1,000 writes PER second.
USAGE COST: Since Dynamo only allows 27 downscaling events per day (4 any time + 1 after an hour since the last downscaling event), we need to batch variantCall writes together to ensure that we end up with efficient usage of the table
Here is an equation that describes how many shards you'll have in Dynamo (from https://cloudonaut.io/dynamodb-pitfall-limited-throughput-due-to-hot-partitions/ ):
MAX( (Provisioned Read Throughput / 3,000), (Provisioned Write Throughput / 1,000), (Used Storage / 10 GB))
SOLUTION:
Batch all writes to the VariantCall table
Put user Id in in SQS queue
A process runs every 2 hours
Note: Throttle writes so that < 1,000 writes per second hit DynamoDB for each user
Or maybe we need AWS Data Pipeline: https://docs.amazonaws.cn/en_us/amazondynamodb/latest/developerguide/DynamoDBPipeline.html
We need to deal with the spiky loads we get when a new report adds a slew of new gene requirements, causing potentially very large imports to occur (adding 60 new genes means 60×N writes, where N is the number of users in the system!).
In addition to increasing table throughput, we may need to batch update Lambda invocations. The places where this needs to happen are marked with
TODO: Split into pieces before calling?
inbioinformatics
.Dynamo auto-scaling may not react quickly enough. According to this article, Dynamo auto-scaling is implemented as a CloudWatch alarm which takes up to 15 minutes to react. This will not work for us, as we have to contend with short Lambda timeouts doing Dynamo writes (300sec maximum).
A Serverless plugin for doing some of this exists, but it may not create a sufficiently aggressive CloudWatch scaling alarm.
Another article which covers dynamo scaling: https://medium.com/rue-la-la-tech/how-rue-la-la-bulk-loads-into-dynamodb-ad1469af578e