scimma / blast

Django web app for the automatic characterization of supernova hosts
MIT License
1 stars 2 forks source link

Design a better version control system for initial data #256

Closed manning-ncsa closed 1 month ago

manning-ncsa commented 1 month ago

Currently the initial data files required to bootstrap the application are stored in an S3 bucket. We need a better system for version control of this file set. S3 is good for storage costs but precludes the use of standard solutions like Git. The file size is large enough that making a full copy each time we change a few files is infeasible. The current, manual, database-migration-style method is not a good long-term solution. For example, the more files are changed or deleted, the more unnecessary overhead is incurred by downloading obsolete files that are subsequently overwritten or deleted entirely.

manning-ncsa commented 1 month ago

One idea that may meet our success criteria is to use the native bucket versioning supported by the MinIO client and the OpenStack Swift bucket we are currently using on Jetstream2. The example below shows how after enabling versioning, a subsequent copy command does not overwrite an object but instead adds another version at the same bucket path.

$ mc version info js-blast/blast-astro-data
js-blast/blast-astro-data is un-versioned

$ mc version info --json js-blast/blast-astro-data
{
 "Op": "info",
 "status": "success",
 "url": "js-blast/blast-astro-data",
 "versioning": {
  "status": "",
  "MFADelete": ""
 }
}

$ mc version enable js-blast/blast-astro-data
js-blast/blast-astro-data versioning is enabled

$ mc version info --json js-blast/blast-astro-data
{
 "Op": "info",
 "status": "success",
 "url": "js-blast/blast-astro-data",
 "versioning": {
  "status": "Enabled",
  "MFADelete": "Disabled"
 }
}

$ mc cp \
    sbi_training_sets/sbi_training_sets/hatp_x_y_global.pkl \
    js-blast/blast-astro-data/v1/data/sbi_training_sets/

$ mc ls --versions --json js-blast/blast-astro-data/v1/data/sbi_training_sets/hatp_x_y_global.pkl
{
 "status": "success",
 "type": "file",
 "lastModified": "2024-09-11T14:44:00.809-05:00",
 "size": 65019321,
 "key": "hatp_x_y_global.pkl",
 "etag": "8abae565ffc324af5810113e6f37c9cd-4",
 "url": "https://js2.jetstream-cloud.org:8001/blast-astro-data/v1/data/sbi_training_sets/hatp_x_y_global.pkl",
 "versionId": "XpAgf4jQ-0fY5wtG6cqSwpltWrHccar",
 "versionOrdinal": 2,
 "storageClass": "STANDARD"
}
{
 "status": "success",
 "type": "file",
 "lastModified": "2024-04-23T08:58:12.313-05:00",
 "size": 65019907,
 "key": "hatp_x_y_global.pkl",
 "etag": "145f0e563928e3979cd7b66b3f7fc1c9-4",
 "url": "https://js2.jetstream-cloud.org:8001/blast-astro-data/v1/data/sbi_training_sets/hatp_x_y_global.pkl",
 "versionId": "null",
 "versionOrdinal": 1,
 "storageClass": "STANDARD"
}

To the existing md5sum file manifest, we would add another column for object versionId. With that manifest, instead of using the mc mirror command in the data init script, we would iterate through the list, comparing each file's md5sum against the etag checksum returned by the mc stat command

$ mc stat --version-id=XpAgf4jQ-0fY5wtG6cqSwpltWrHccar --json \
    js-blast/blast-astro-data/v1/data/sbi_training_sets/hatp_x_y_global.pkl | jq --raw-output '.etag'

8abae565ffc324af5810113e6f37c9cd-4

If the checksums mismatch, then the file is overwritten by the target version. In this way, we have robust version control of initial data files.

There is an issue related to removal of files: if a file is removed from the inital data set in a revision, and if the presence of that original file could cause a problem, then we may need to include another (version-controlled) manifest declaring "files to remove if they exist".