swcarpentry / good-enough-practices-in-scientific-computing

Minimalist alternatives to "best practices" paper
https://swcarpentry.github.io/good-enough-practices-in-scientific-computing/
Other
157 stars 23 forks source link

git-lfs for data #156

Closed jennybc closed 7 years ago

jennybc commented 8 years ago

Recording/moving a twitter conversation here w/o 140 character limit.

From Bjørn Fjukstad (github @fjukstad, twitter @fjukstad) via Twitter:

great read, but why not git-lfs for version control of datasets?

I replied with this link: GitHub’s Large File Storage is no panacea for Open Source — quite the opposite.

Bjørn pointed out that article is about storing large files on GitHub.com specifically. It's not necessarily a reason to abandon git-lfs. Recommended this list of alternative implementations and http://www.pachyderm.io.

My 2 cents: if it's not something all of us authors are using routinely, then it's not right for this particular paper. We even dropped many things we do (version control! tests!), in order to focus on people just entering the on-ramp.

But I like to capture these discussions for ... edification, future articles, whatever. I agree that change tracking for data, at both the pro and amateur levels, is not at all sorted out. Thanks @fjukstad.

karthik commented 8 years ago

Weighing in with minimal context but I agree completely. It holds other people (usually novices) to an unrealistic standard that we ourselves don't practice. It's good to separate here are things we do here are things we would like to someday do if they work and people are willing to use them

lexnederbragt commented 8 years ago

Here is the 140 character-limited starting point: https://twitter.com/lexnederbragt/timelines/771781985798946816

jennybc commented 8 years ago

As @fjukstad points out on twitter, this could be added to "what we left out".

fjukstad commented 8 years ago

Hey!

First off: Great read, lot's of good points to take home!

I agree that version control of datasets (especially intermediate data) isn't really a mainstream thing yet, but I believe that it's a step towards reproducible research. E.g. if you're analyzing RNA seq data through a pipeline with multiple stages. Keeping the intermediate data (and results) under version control would simplify the process of making your results reproducible even if you update a tool in the middle of the pipeline. With these new tools such as git-lfs and pachyderm I think that it's something for readers to be aware of!

jennybc commented 8 years ago

I have RNA seq data in Git and on GitHub. Not the raw data, of course, but once I had it in a differential analysis pipeline. Yes it's awkward. But the datasets get progressively smaller as you move them through the analysis, so luckily the bits that are changing the most are also the smallest.

den-run-ai commented 5 years ago

I have not looked at pachyderm too closely, but listened the podcast about it and it has a lot of support for data engineering workflows. Anyway for machine learning workflow DVC looks more applicable than git-lfs or pachyderm:

https://github.com/iterative/dvc

Disclaimer: I have a bit of bias, because DVC is written in Python and by Russian-speaking developers.