Open kousu opened 2 years ago
Plus, what I haven't seen anywhere in the science scene, but what's worth mentioning, is using a package-manager: e.g. many video games are packaged with a seperate -data
package (e.g. 0ad-data. This is roughly my plan for SCT: https://github.com/spinalcordtoolbox/data-template/pull/1
In the survey of approaches, I see basically two options:
Keep the dataset as a component that is automatically downloaded by a tool: a submodule with git, a package with apt/pip/conda/brew.
$ tool install <protocol>://project/with/analysis/scripts
$ run-analysis-script
Make downloading the data an idempotent one-liner as the top of your script, e.g.
# at the top of: run-analysis-script
dataset = downloader.get_data("datasetname", "v1.0.2");
$ run-analysis-script
In the second vein, there's this option with git
:
git clone --depth 1 -b v1.0.2 git@data.neuro.polymtl.ca:datasets/data-set-name.git
This will download the least amount of data-set-name/
needed: only the parts relevant for version 1.0.2.
This has some problems though:
set -e
, which you should be) the second time the script is run because the repo already exists-b
and rerun it it will fail to update the datasetThere's a bunch of ways we could think of to fix each of these -- checking if the directory exists first and handling it differently if so; rm -rf
'ing it first, etc).
Here's what I've come up with:
git_clone_idempotent_branch() {
branch=$1; shift
repo=$1; shift
dir=$1
if [ -z "$dir" ]; then
dir=$(basename $repo) # XXX not quite right but close enough
fi
(
set -e
mkdir -p "$dir" && \
cd "$dir" && \
git init >/dev/null && \
git remote add origin "$repo" 2>/dev/null ;
ref="$(git ls-remote origin "$branch" | awk '{print $2}')" # should get either refs/heads/$branch or refs/tags/$branch
git checkout --orphan _orphan 2>/dev/null # avoids "fatal: Refusing to fetch into current branch $ref of non-bare repository"
git fetch -f --depth 1 origin "$ref":"$ref" && \
git checkout -f "$branch"
)
}
If you do
git_clone_idempotent_branch v1.0.2 git@data.neuro.polymtl.ca:datasets/data-set-name
Then it behaves like git clone -b v1.0.2 --depth 1 git@data.neuro.polymtl.ca:datasets/data-set-name
, except that it can handle switching branches, while still only downloading the minimum needed. If you switch a branch it will only download the .git/objects/
that are different between that branch and the previously existing one.
It has at least one bug, which is that it doesn't translate the repo URL into a local dir name the same way that git
does: it doesn't strip the .git
, it might be suceptible to directory traversal attacks, etc. So that should be examined. And the --orphan _orphan
line is janky. But it should work, generally.
Just place this at the top of your processing scripts and we won't need to worry about tracking down datasets again.
But the first approach, using git submodule
or the equivalent, has many advantages:
~/.cache
, datasets can be placed in system-standardized places (which may or may not be desired, depending on how large your partitions are)But these disadvantages:
git submodule
is a UI nightmare and pip
and .deb
aren't much better; dvc pull
sounds easy for the consumer but I bet dvc add --to-remote
is a giant pain, given the experience we had trying to get git-annex
to work with Amazon S3.git submodule
but only if you keep all the sub-repos on a single server. pip
doesn't build full URLs into its names, but at the cost of making it hard to use any server but pypi.org (it's possible, with pip -f
, so you can keep a backup if you want..but requires some effort)After talking to @jcohenadad I realize there's a third approach too.
In summary:
Use a dependency system (e.g. git submodule
):
Publisher:
cd analysis/
git init
git submodule add https://data.neuropoly.org/datasets/sct-testing-large
User:
git pull --recurse-submodules https://data.neuropoly.org/analysis # downloads the dataset implicitly
cd analysis
./analysis.py
Embed the data downloading into the processing script:
Publisher writes:
#!/usr/bin/env python3
# analysis.py
...
dataset = pooch.retrieve("https://data.neuropoly.org/datasets/sct-testing-large/archive/refs/tags/5.4.zip")
User:
git clone https://data.neuropoly.org/analysis
cd analysis
./analysis.py
Put a dataset download command in the README:
Publisher writes:
# Installation
First, get the dataset by
curl https://data.neuropoly.org/datasets/sct-testing-large/archive/refs/tags/5.4.zip && unzip 5.4.zip
or
git clone --depth 1 -b 5.4 https://data.neuropoly.org/datasets/sct-testing-large
Then, install the analysis code by
git clone https://data.neuropoly.org/analysis && cd analysis && pip install -e .
Then run
analysis -i ../5.4/ ...
User: follows the instructions as written.
An advantage of this method is the human factor: it makes the relationships between the components less abstract especially for students getting used to the idea of making reproducible science.
as always, thank you for the thorough investigations @kousu.
i like https://github.com/neuropoly/data-management/issues/136#issuecomment-941575717. I guess it does not solve the issue you raised earlier about the error thrown if the folder already exists, but given this is not all embedded into an opaque script, people will likely realize the cause of the error and manually remove the existing data folder.
https://huggingface.co/ was pointed out to me today. It's a cloudy-startup that does model hosting.
It seems to basically be a git server with git-lfs turned on: the models get stored in LFS to make downloads friendlier (though #68 would have solved that too for them), and there's a button that will run it on an Amazon VPS for you. But there doesn't seem to be any requirement to upload associated pytorch/torchvision/etc preprocessing code or even the original training scripts that produced the trained weights so it's unclear to me how many of the models hosted here are actually self-contained.
Related:
Some ideas in here https://goodresearch.dev/pipelines.html about versioning experiments-as-code.
Its suggestions are basically:
Use a cloud service:
As for the offline world, it gestures at:
FWIW, some students are already using wandb.ai for ivadomed-related projects. We even set up some instructions for newcomers in the lab
FWIW, some students are already using wandb.ai for ivadomed-related projects. We even set up some instructions for newcomers in the lab
Yep! I saw. I first saw it over there, before seeing it on xcorr's page.
You can probably already tell but I don't think any cloud options are a good idea in the long-term. SaaS can only be sustainable if its business strategy is lock-in. Can we export and self-host our records from there and republish them from WanDB elsewhere? Probably not. Even if we could, all our papers would cite the https://wandb.ai URLs so in practice it would be extremely difficult to unlock ourselves from.
https://github.com/jvns/git-commit-folders would be really really useful to integrate as a standard harness for training scripts; see https://github.com/neuropoly/data-management/issues/68#issuecomment-1858441789
There's about a hundred tools for tracking scientific data in processing pipelines.
git submodule
: your analysis/processing script needs to be under git, with its dependencies also tracked under git (and their version needs you to havedatalad
installed)dvc
hasdvc add [--to-remote]
which then, I think(?) can be invoked withdvc repro
(or maybe you need todvc pull && dvc repro
?)academictorrents.get()
pooch
is straightforward, you just writelocal_file = pooch.retrieve("https://example.com/path/to/dataset.tgz")
; and several common scientific libs have an ad-hoc version of this in their utils folder:tensorflow.keras.utils.data_utils.get_file
as used in e.g.tensorflow.keras.datasets.fashion_mnist.load_data
torch.utils.data.Dataset
as used in e.g.torchvision.datasets
torchio.data.dataset
as used intorchio.datasets
ntlk.downloader
nilearn.datasets
mne.datasets
Currently in the lab we are mostly not writing down the datasets we're using anywhere, or maybe passing them in as arguments to training scripts, or sometimes writing downs paths (e.g. https://github.com/ivadomed/ivadomed/blob/3e3989d408e5cc85cbc50354dc13e1b460648dbc/ivadomed/config/config_vertebral_labeling.json#L8) to data on our internal server. We should settle on an approach that lets us reproduce analyses, without anyone needing to go hunting for missing data. And we should do it with
git
, because that's the whole long-term plan: to integrate a way to version our datasets alongside our code.