subugoe / leine

Data Pipelines for @subugoe/wag
https://subugoe.github.io/leine
MIT License
1 stars 0 forks source link

document raw data version control best practice #13

Open maxheld83 opened 3 years ago

maxheld83 commented 3 years ago

this builds on #12.

There's two ingredients to get a grip on reproduciblity here:

  1. the raw data (that's here)
  2. the queries on that data (that's #12).
maxheld83 commented 3 years ago

for gcs,. this may be relevant: https://cloud.google.com/storage/docs/object-versioning Maybe carry the generation number of the SHA of the ingest code that produced it?

maxheld83 commented 3 years ago

this is about data versioning in the context of ML, but our problem is essentially the same, even without ML (yet): https://emilygorcenski.com/post/data-versioning/

maxheld83 commented 3 years ago

also this: https://towardsdatascience.com/comparing-data-version-control-tools-2020-c11ef1c80ea7

Seems like we may run into trouble with git lfs, so maybe dvc.