kevin-hanselman / dud

A lightweight CLI tool for versioning data alongside source code and building data pipelines.
https://kevin-hanselman.github.io/dud/
BSD 3-Clause "New" or "Revised" License
183 stars 8 forks source link

Getting started suggests gitignoring symlinks to cached files, why? #207

Closed veriditin closed 4 months ago

veriditin commented 4 months ago

Dear developer(s)/Kevin,

We are evaluating tooling to use in our data pipelines, and after a discussion I saw on Hacker News dud seemed like the tool to try due to its small scope and composability, as opposed to some other well-known tool in this space :)

The getting started guide is very nice, it's interesting to see how easy it works and the decision to delegate the storage syncing stuff to rclone is imo a great one. So thanks a lot :)

I am currently confused by one thing.

In the getting started guide you mention needing to add the files tracked by dud to your .gitignore, if you want to commit your data pipeline to a git repository (obviously, we do).

However, having played around with it, considering the files managed by dud are always in the cache, which is ignored by default, and once the files a dud-committed only hard-links remain in the actual repository, what are the downsides about committing these hard-links to the git repo?

E.g. the viewer in gitlab handles it quite nicely, suggesting that it is indeed a hardlink to a content-addressed file in the cache:

image

and when trying to use the file in any script, we can clearly see that some step is still missing (e.g. performing the untarring manually):

>>> tar -xvf cifar-10-python.tar.gz
cat: cifar-10-python.tar.gz: No such file or directory

which will immediately trigger you to think: Ah! dud fetch/dud pull

Adding all these files manually to a .gitignore is quite an annoying process, especially once your data pipeline starts to be built up from multiple different sources of data, that are processed by different stages and have results that are difficult to completely know and thus to ignore.

So, is committing the hardlinks to git completely fine, or am I missing something?

kevin-hanselman commented 4 months ago

Hi, @veriditin, and thanks for the thoughtful post!

You make a good point. Committing links to Git shouldn't inherently cause problems. However, I think the biggest risk is that you could accidentally commit something other than a link to Git. For example, what if you forget to dud commit a large binary, and then you git commit it? Sure, you should ideally notice that pretty quickly, but also maybe not. See also Murphy's law 😃.

Ultimately the decision to gitignore files tracked by Dud is a safeguard against simple mistakes. I still recommend that you do so, but it's your decision to make for your projects.

Regarding the perceived tedium of gitignoring every binary, I would recommend using glob patterns in your .gitignore files and multiple .gitignore files to make this a lot easier. For example, if your datasets are comprised of image files, you might add *.jpg and/or *.png to your project's root .gitignore; from then on, all images will be ignored by Git. For another example, if you have a few heterogeneous datasets in your project, you could put each of them in folders named <dataset_name>/data, and ignore them with <dataset_name>/.gitignore files.

I think this thread is a better fit for a discussion, so I'll be transferring it over there. Thanks again for sharing your thoughts!