Closed drlaw1558 closed 6 years ago
Elegant solutions that work for everyone may be a tough one on this. FWIW, I also use environment variables to point to necessary notebook data, especially since I'm often dealing with examples using fits files, which you don't want stored in github.
When available one could also provide the link back to the archive holding or download commands depending on where the data live.
Yeah, this is a hard usability problem. One solution for this I've seen used in the past is Git-LFS which LSST are using for some of their repos.
It would be good to develop some recommended solutions to this.
Dataversioncontrol (https://dataversioncontrol.com) looks interesting in this regard.
Dataversioncontrol (https://dataversioncontrol.com) looks interesting in this regard.
Agreed. Looks very similar to Git-LFS in approach.
@eteq wrote this up, seems generally applicable in this context: https://innerspace.stsci.edu/pages/viewpage.action?pageId=129671315
Another issue that would be useful to have a style guide for: linked data directories.
This is particularly relevant to notebooks, which might operate on a data file that they need to find in order for an end-user to run the notebook successfully. I can see two kinds of files: 1) Small data files that make sense to live within the repository itself, and can easily be linked with an environmental variable 2) Large data files that shouldn't be in the repository but staged elsewhere.
Being relatively new to git and python (from svn and IDL) I've been inventing my own approaches using environmental variables pointing to (1) the local repo checkout directory, and (2) a corresponding data directory on central store. Works well enough, but if there is a more elegant solution it would be helpful to describe here.