RFC: S3 concurrent writer via dynamoDB

chebbyChefNEQ commented 12 months ago

Context

S3 does not support transactional-operations like *_if_not_exist. We would like to support transactional-operations on S3 for concurrent writers.

Requirements

Implement a native locking mechanism in Lance to support *_if_not_exist on s3.
This new mechanism should be usable in all language lance binds to.
Ensure we still have snapshot isolation. (unsuccessful and en route commit should not be visable)
(Nice to have) Pave the road for redis/sql based commit mechanisms.

Why DynamoDB?

Out of the three major Cloud providers, AWS(S3), GCP(GCS), and Azure(Blob Storage), S3 is only one that does not provide transactional semantics in its APIs. This means, it is okay to use an AWS native solution as we do not need to port this locking mechanism to other cloud providers.

(minio, http, et al. will probably use redis/sql commit mechanism)

Interface

Since this is a S3-only feature, it doesn't make a lot of sense to add more params to ObjectStoreParams nor [Read|Write]Params.

The proposed solution is to pass in dynamo params as part of the s3 URI with the following syntax

s3+ddb://<bucket>/<path>?ddbTableName=<ddb_table_name>

Problems with dynamodb-lock-rs

The library does not support unexpirable locks. This means, the library can only support synchronous fail-stop model. (: the lock lease duration effectively converts async consensus into a sync consensus with time as epoch boundaries).

This can cause data corruption under pause or network partition failure model. (The lock can expire while manifest commit is still going on)

Problem with unexpirable locks

They can crash-deadlock and are only recoverable by manual intervention

Crux of the problem

Having two sources of truth makes consensus with strong consistency requirement not possible. (S3 is the source of truth for manifest version, while Dynamo is the source of truth of which writer's manifest is valid)

Proposed solution

external_store_commit

When committing to S3, we can use the following procedure

For procedure commit(manifest: Manifest) do

Write the manifest we want to commit to S3 with a temporary name e.g. dataset.lance/_versions/staging/152a8c85-e4e7-4a1f-96d1-1cef1c6dbd9d.manifest
Try to commit this version to dynamodb with attribute_not_exists(PK) AND attribute_not_exists(SK) condition. Where the committed data would look like
```
{
pk: {table_uri}, <-- String
sk: {version}, <-- int
commiter: <hostname>, <-- String
uri: <temporary_uri> <-- String
}
```
If dynamodb commit fails, return err
Writer proceed to call copy and copy the manifest from temporary path to dataset.lance/_version/{version}.manifest
Update DynamoDB with dataset.lance/_version/{version}.manifest

What if step 4 fails?

To address this problem, We should always load manifest from dynamodb. When loading manifest we can use the following procedure:

For procedure checkout_latest(uri: &str) do

The loader reach out to dynamodb first and find the latest version V of the dataset.
Ensure step 4, 5 on commit loop is done, if not, retry 4, 5

This checkout procedure can guarantee that

When DynamoDB and S3 go out of sync, S3 is at most behind by one version
All operations are blocked until the two stores are in sync again

Caveats

This solution changes the source of truth for manifests from s3 to dynamodb. This means

Reader and writer must both use dynamodb to ensure consistency
To off-board dynamodb, user would have to make sure the s3 and dynamodb are in sync before they can safely off-board.

Execution Plan

[x] add checkout method to CommitHandler
[x] implement ExternalManifestCommitHandler
[x] implement DynamoDBS3CommitHandler
[x] implement parsing in io::ObjectStore to recognize s3+ddb:// scheme
[ ] #1206

chebbyChefNEQ commented 12 months ago

cc: @wjones127 @westonpace @eddyxu

wjones127 commented 12 months ago

This looks good. I wish we could didn't have to modify the read path, but there are some edge cases that make it necessary.

mapleFU commented 11 months ago

This sounds interesting!

I'm not sure I fully understand it. Seems DynamoDB is source of truth in system, however, the commit would regard as "success" and can be seen by reader when finally data is "post committed" in s3? And this will not change the process of reader, just make writer need to handle more roundtrip of write?

chebbyChefNEQ commented 11 months ago

@mapleFU Yeah roughly. This is making DynamoDB the source of truth and S3 a replica, but with some constraints on how out-of-sync the two stores can be. On the read path, the reader will try to sync the two stores and refuse to load if the two stores can not be sync'd. As long as the two stores are in sync, users can freely offboard dynamodb.

See #1190 for the commit logic

sandias42 commented 2 months ago

Are there any rough back-of-the-envelopes on how this will work out in terms of cost compared to S3 alone or other workarounds? Another workaround is to switch to AWS EFS, but for analytical workloads reading/ writing the whole dataset frequently the read and write elastic bandwidth charge is horrific, like >50X the cost of S3 so far in my hands (the storage cost is ~7X but this is insignificant compared to the above).

If I understand correctly the proposed fix only stores a subset of what I'd normally see in my lance s3 directory (the manifest) on DynamoDB, rather than a full copy of the data? If anyone wanted to provide guesses on what fraction of the 1) GB storage and 2) Read / Write requests would route to DynamoDB compared to a just-S3 setup I'm happy to share my napkin math on total relative cost as I need this for figuring out viability of lancedb at my company.

sandias42 commented 2 months ago

By the way, if there are a lot of read/write requests going to DynamoDB, but a low total Read/ Write throughput in cumulative GB read/written and medium-ish total GB stored, then a commit mechanism in EFS with S3 as main storage might be another cost effective option. As far as I know, EFS does not charge by the request (although there may be annoying IOPS limits, not sure) and the storage cost is not so bad, so as long as the volume of data transferred to/from EFS is low it could maybe allow more read/ write request per dollar.

chebbyChefNEQ commented 2 months ago

hi @sandias42, with dynamo manifest store, the expected amount of IOs are as follows Read:

get latest dataset -- 1 full read (we always use consistent read)
get version N -- no read, hits object store directly

Commit:

no conflict -- 2 writes (1 for the initial temp commit, 1 for finalization)
N conflicts -- 2 * (N + 1) writes

the total IOPS is num_commit/s * ^^^

w.r.t. to the read write size, we only commit a path like _version/3.manifest-<tmp-id> value to dynamodb. so the throughput util should be very low.

lancedb / lance