subugoe / leine

Data Pipelines for @subugoe/wag
https://subugoe.github.io/leine
MIT License
1 stars 0 forks source link

use DVC to version control big data expensive to wrangle/download #18

Open maxheld83 opened 2 years ago

maxheld83 commented 2 years ago

There's three sources of data diffs to be version controlled, each too big for git:

  1. changes in large datasets (for example: ISSN/ISSN-L) (already covered by #13) These changes can have a big impact on reproducibility of downstream results. This is separate from a substantive interest in longitudinal data (for example: cr dumps), where the change over time may be interesting in/of itself. For ISSN/ISSN-L at any given point in time, we care only about the current mapping, we have no interest in how these changed historically. For cr dumps, we may (e.g. development of HOAD) at any one point in time be interested in changes up to that point.
    (actually cr dumps for any given month/year can also change after the fact, so that's a source of diffs, too 😐).
  2. (git diffed) changes in how we wrangle data; resulting objects can be so expensive, that just recomputing based on git may be too expensive -- instead, we should keep these changes as well.
  3. then for versioning bq queries / tables there's #12

1 may already be well-covered by just storing gcs shas or whatever. 2, if needed, should use DVC.

And if we use DVC, we might as well use it for 1, too.