Open maxheld83 opened 3 years ago
for gcs,. this may be relevant: https://cloud.google.com/storage/docs/object-versioning
Maybe carry the generation
number of the SHA of the ingest code that produced it?
this is about data versioning in the context of ML, but our problem is essentially the same, even without ML (yet): https://emilygorcenski.com/post/data-versioning/
also this: https://towardsdatascience.com/comparing-data-version-control-tools-2020-c11ef1c80ea7
Seems like we may run into trouble with git lfs, so maybe dvc.
this builds on #12.
There's two ingredients to get a grip on reproduciblity here: