Refactor clear commands

seasketch / geoprocessing

Serverless geoprocessing system

https://seasketch.github.io/geoprocessing

BSD 3-Clause "New" or "Revised" License

12 stars 2 forks source link

Refactor clear commands #320

Closed twelch closed 1 month ago

twelch commented 2 months ago

gp clear commands are for clearing gp function results cached in the tasks dynamodb table.

The problem is it's very slow, running often crashes with out of memory error, and at least for the clearResults command it doesn't always clear all of the records.

In aws-sdk v3, to increase performance, dynamodb batchWriteItem is supposed to support delete. Should be able to batch delete at least 25 items at a time.

Suggest unit testing this using the local dynamodb service

https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/client/dynamodb/command/BatchWriteItemCommand/ https://stackoverflow.com/a/9159431/4159809

twelch commented 2 months ago

Might also be able to use partiql batch commands - https://github.com/awsdocs/aws-doc-sdk-examples/blob/main/javascriptv3/example_code/dynamodb/actions/partiql/partiql-batch-delete.js

twelch commented 2 months ago

another option might be to re-synth the stack without the table and re-deploy (to delete the table), then re-synth the stack again with the table back (to create it again).

There may also be resources relying on/connected to this table and it wouldn't easily let you just delete it. Manually deleting it would create drift which may cause an error.

twelch commented 2 months ago

Scan, Query, BatchGetItem. Which to use?

It looks like a scan is required to get all of the items in a table.

https://stackoverflow.com/questions/19583563/get-the-dynamodbs-primary-key-list

A query is faster than a scan, but it requires you to provide a hash key (aka partition key). We don't store multiple items under a single hash key, so this doesn't help us for the use case of deleting all tasks, or deleting all tasks given a service name (which is used as range key, not partition key).

Paginator wrapper is neat way of querying with paging but only works if you have a single partition key:

twelch commented 2 months ago

Since we could have 1000, even 10,000 task results in a table for a given project, and each db item probably averages 50-100KB, this could be up to 1GB in size.

Another option to just "delete" all items, would be to set a unique prefix to table names on creation (see GeoprocessingStack.ts) tableName:gp-${stack.props.projectName}-tasks``

The table prefix could be stored as CfnOutput value, which can then be read at deploy time using describe-stacks command which outputs JSON.

The prefix value can be re-used, unless user indicates at deploy time that they would like to regenerate the tables, clearing all values.

aws cloudformation describe-stacks --stack-name gp-california-reports-tim --region us-west-1

aws-sdk version of this is here - https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/client/cloudformation/command/DescribeStacksCommand/

twelch commented 2 months ago

BatchWriteItem info

twelch commented 2 months ago

paginate* functions are now offered by AWS for handinling pagination of get query. e.g. paginateScan

https://stackoverflow.com/a/76956635/4159809

3rd party version available with parallelization - https://github.com/shelfio/dynamodb-parallel-scan