duplicate data in directory structure

tilburgsciencehub / deprecated-website

Tilburg Science Hub is an open-source resource repository that supports students and researchers in the social sciences to efficiently manage data- and computation-intensive projects.

http://tilburgsciencehub.com

10 stars 6 forks source link

duplicate data in directory structure #16

Open rgreminger opened 4 years ago

rgreminger commented 4 years ago

The proposed directory structure is such that the same data will be stored in multiple locations, as it will be duplicated into (possibly multiple) input folders. Though the structure with the input folders adds clarity for the workflow, an approach like this could use up a lot of disk space very quickly (except if symbolic links are used, though I doubt that those would work across different platforms).

hannesdatta commented 4 years ago

the key is to keep different pipeline stages portable. i.e., you can work on analysis and I have prepped the dataset. I know that for the main guy on this project, you do have a lot of duplicate files. I kind of am fine with this because disk space is cheap. if you can find a solution, let me know.

another issue is that for this minimal example, we could host a zip with the raw data on TilburgScienceHub, as we strictly need to avoid teaching you can store your data on GitHub. makes sense? download data via R script is platform-independent...

rgreminger commented 4 years ago

Portability definitely is a good point. I'll try to implement this in the example some time soon, but one thing is a bit unclear to me from the site (though I might just have missed this part). What is the best approach to keep the input folders up-to-date with upstream changes? Should this be done by upstream, or downstream?

Good idea regarding the raw data. I'll try adding the zip to the page through a PR, and will update the example.