Design a better version control system for initial data

One idea that may meet our success criteria is to use the native bucket versioning supported by the MinIO client and the OpenStack Swift bucket we are currently using on Jetstream2. The example below shows how after enabling versioning, a subsequent copy command does not overwrite an object but instead adds another version at the same bucket path.

$ mc version info js-blast/blast-astro-data
js-blast/blast-astro-data is un-versioned

$ mc version info --json js-blast/blast-astro-data
{
 "Op": "info",
 "status": "success",
 "url": "js-blast/blast-astro-data",
 "versioning": {
  "status": "",
  "MFADelete": ""
 }
}

$ mc version enable js-blast/blast-astro-data
js-blast/blast-astro-data versioning is enabled

$ mc version info --json js-blast/blast-astro-data
{
 "Op": "info",
 "status": "success",
 "url": "js-blast/blast-astro-data",
 "versioning": {
  "status": "Enabled",
  "MFADelete": "Disabled"
 }
}

$ mc cp \
    sbi_training_sets/sbi_training_sets/hatp_x_y_global.pkl \
    js-blast/blast-astro-data/v1/data/sbi_training_sets/

$ mc ls --versions --json js-blast/blast-astro-data/v1/data/sbi_training_sets/hatp_x_y_global.pkl
{
 "status": "success",
 "type": "file",
 "lastModified": "2024-09-11T14:44:00.809-05:00",
 "size": 65019321,
 "key": "hatp_x_y_global.pkl",
 "etag": "8abae565ffc324af5810113e6f37c9cd-4",
 "url": "https://js2.jetstream-cloud.org:8001/blast-astro-data/v1/data/sbi_training_sets/hatp_x_y_global.pkl",
 "versionId": "XpAgf4jQ-0fY5wtG6cqSwpltWrHccar",
 "versionOrdinal": 2,
 "storageClass": "STANDARD"
}
{
 "status": "success",
 "type": "file",
 "lastModified": "2024-04-23T08:58:12.313-05:00",
 "size": 65019907,
 "key": "hatp_x_y_global.pkl",
 "etag": "145f0e563928e3979cd7b66b3f7fc1c9-4",
 "url": "https://js2.jetstream-cloud.org:8001/blast-astro-data/v1/data/sbi_training_sets/hatp_x_y_global.pkl",
 "versionId": "null",
 "versionOrdinal": 1,
 "storageClass": "STANDARD"
}

To the existing md5sum file manifest, we would add another column for object versionId. With that manifest, instead of using the mc mirror command in the data init script, we would iterate through the list, comparing each file's md5sum against the etag checksum returned by the mc stat command

$ mc stat --version-id=XpAgf4jQ-0fY5wtG6cqSwpltWrHccar --json \
    js-blast/blast-astro-data/v1/data/sbi_training_sets/hatp_x_y_global.pkl | jq --raw-output '.etag'

8abae565ffc324af5810113e6f37c9cd-4

If the checksums mismatch, then the file is overwritten by the target version. In this way, we have robust version control of initial data files.

There is an issue related to removal of files: if a file is removed from the inital data set in a revision, and if the presence of that original file could cause a problem, then we may need to include another (version-controlled) manifest declaring "files to remove if they exist".

scimma / blast

Design a better version control system for initial data #256