ropensci / gittargets

Data version control for reproducible analysis pipelines in R with {targets}.
https://docs.ropensci.org/gittargets/
Other
87 stars 1 forks source link

Pure AWS S3 backend #5

Closed wlandau closed 1 year ago

wlandau commented 2 years ago

Prework

Proposal

Similar to #2, but directly implemented on top of AWS S3 through something like aws.s3, paws, or botor. Use the historical versioning and tagging capabilities of buckets.

wlandau commented 2 years ago

Probably precedes #2. May actually want to look at DVC first. It may already do a lot of the stuff I mention below.

wlandau commented 2 years ago

Setback: an S3 object can only have up to 10 tags. Poses a problem if a target is part of more than 10 snapshots, which is likely to come up for almost all projects.

wlandau commented 2 years ago

Another idea: the metadata already has hashes, which is half the battle for a key-value store.

Snapshot

  1. Commit _targets/meta/meta to a local git repo. Do not commit _targets/objects.
  2. For each target in _targets/meta/meta, upload the file in _targets/objects an S3 bucket. In the bucket, the object name should be the hash recorded in _targets/meta/meta. If the object already exists in the bucket, skip the upload.

Checkout

  1. Check out the metadata file.
  2. For each target in the metadata, if the hash in _targets/meta/meta disagrees with the actual hash of the file, attempt to find the correct hash in the bucket and download the object to _targets/objects/.

Hopefully (2) will be possible without cloning a bunch of infrastructure from targets.

Status

Git status of _targets/meta/meta + checking the hashes of _targets/meta/meta vs _targets/objects files vs the bucket.

wlandau commented 2 years ago

Closing in favor of https://github.com/ropensci/targets/issues/711

wlandau commented 2 years ago

Reopening. Relative to native AWS versioning in targets, an AWS gittargets backend would allow less frequent uploads and allow users to opt in later in the project’s life cycle.

wlandau commented 1 year ago

On reflection: if you're already using AWS S3, then https://books.ropensci.org/targets/cloud-storage.html is way better.