Integrating `git` into training scripts

kousu commented 2 years ago

There's about a hundred tools for tracking scientific data in processing pipelines.

datalad recommends git submodule: your analysis/processing script needs to be under git, with its dependencies also tracked under git (and their version needs you to have datalad installed)
dvc has dvc add [--to-remote] which then, I think(?) can be invoked with dvc repro (or maybe you need to dvc pull && dvc repro?)
academictorrents.get()
pooch is straightforward, you just write local_file = pooch.retrieve("https://example.com/path/to/dataset.tgz"); and several common scientific libs have an ad-hoc version of this in their utils folder:
tensorflow.keras.utils.data_utils.get_file as used in e.g. tensorflow.keras.datasets.fashion_mnist.load_data
torch.utils.data.Dataset as used in e.g. torchvision.datasets
torchio.data.dataset as used in torchio.datasets
ntlk.downloader
nilearn.datasets
mne.datasets

Currently in the lab we are mostly not writing down the datasets we're using anywhere, or maybe passing them in as arguments to training scripts, or sometimes writing downs paths (e.g. https://github.com/ivadomed/ivadomed/blob/3e3989d408e5cc85cbc50354dc13e1b460648dbc/ivadomed/config/config_vertebral_labeling.json#L8) to data on our internal server. We should settle on an approach that lets us reproduce analyses, without anyone needing to go hunting for missing data. And we should do it with git, because that's the whole long-term plan: to integrate a way to version our datasets alongside our code.

kousu commented 2 years ago

Plus, what I haven't seen anywhere in the science scene, but what's worth mentioning, is using a package-manager: e.g. many video games are packaged with a seperate -data package (e.g. 0ad-data. This is roughly my plan for SCT: https://github.com/spinalcordtoolbox/data-template/pull/1

kousu commented 2 years ago

In the survey of approaches, I see basically two options:

Keep the dataset as a component that is automatically downloaded by a tool: a submodule with git, a package with apt/pip/conda/brew.
```
$ tool install <protocol>://project/with/analysis/scripts
$ run-analysis-script
```

Make downloading the data an idempotent one-liner as the top of your script, e.g.

# at the top of: run-analysis-script
dataset = downloader.get_data("datasetname", "v1.0.2");

$ run-analysis-script

kousu commented 2 years ago

In the second vein, there's this option with git:

git clone --depth 1 -b v1.0.2 git@data.neuro.polymtl.ca:datasets/data-set-name.git

This will download the least amount of data-set-name/ needed: only the parts relevant for version 1.0.2.

This has some problems though:

it will print an error (stop the script if you've used set -e, which you should be) the second time the script is run because the repo already exists
if you use a branch instead of a tag; or if a tag changes (not ideal but it can happen); then this will fail to update the dataset to the new version, making results out of sync with someone trying to reproduce on a different computer
if you change the branch/tag given in -b and rerun it it will fail to update the dataset

There's a bunch of ways we could think of to fix each of these -- checking if the directory exists first and handling it differently if so; rm -rf'ing it first, etc).

Here's what I've come up with:

git_clone_idempotent_branch() {
  branch=$1; shift
  repo=$1; shift
  dir=$1

  if [ -z "$dir" ]; then
    dir=$(basename $repo) # XXX not quite right but close enough
  fi

  (
  set -e
  mkdir -p "$dir" && \
  cd "$dir" && \
  git init >/dev/null && \
  git remote add origin "$repo" 2>/dev/null ;
  ref="$(git ls-remote origin "$branch" | awk '{print $2}')" # should get either refs/heads/$branch or refs/tags/$branch
  git checkout --orphan _orphan 2>/dev/null  # avoids "fatal: Refusing to fetch into current branch $ref of non-bare repository"
  git fetch -f --depth 1 origin "$ref":"$ref" && \
  git checkout -f "$branch"
  )
}

If you do

git_clone_idempotent_branch v1.0.2 git@data.neuro.polymtl.ca:datasets/data-set-name

Then it behaves like git clone -b v1.0.2 --depth 1 git@data.neuro.polymtl.ca:datasets/data-set-name, except that it can handle switching branches, while still only downloading the minimum needed. If you switch a branch it will only download the .git/objects/ that are different between that branch and the previously existing one.

It has at least one bug, which is that it doesn't translate the repo URL into a local dir name the same way that git does: it doesn't strip the .git, it might be suceptible to directory traversal attacks, etc. So that should be examined. And the --orphan _orphan line is janky. But it should work, generally.

Just place this at the top of your processing scripts and we won't need to worry about tracking down datasets again.

kousu commented 2 years ago

But the first approach, using git submodule or the equivalent, has many advantages:

it scales better, because redundant mirrors can be set up for all the components
it is friendlier to the OS and its user: instead of downloading into the working directory or maybe ~/.cache, datasets can be placed in system-standardized places (which may or may not be desired, depending on how large your partitions are)
the entire project can be installed in one step, and therefore archived or saved for later; the project isn't as "alive": it doesn't depend on services being live to run.

But these disadvantages:

it's harder to publish; git submodule is a UI nightmare and pip and .deb aren't much better; dvc pull sounds easy for the consumer but I bet dvc add --to-remote is a giant pain, given the experience we had trying to get git-annex to work with Amazon S3.
it's still not that archivable. You can use relative URLs with git submodule but only if you keep all the sub-repos on a single server. pip doesn't build full URLs into its names, but at the cost of making it hard to use any server but pypi.org (it's possible, with pip -f, so you can keep a backup if you want..but requires some effort)

kousu commented 2 years ago

After talking to @jcohenadad I realize there's a third approach too.

In summary:

Use a dependency system (e.g. git submodule):

Publisher:

cd analysis/
git init
git submodule add https://data.neuropoly.org/datasets/sct-testing-large

User:

git pull --recurse-submodules https://data.neuropoly.org/analysis  # downloads the dataset implicitly
cd analysis
./analysis.py

Embed the data downloading into the processing script:

Publisher writes:

#!/usr/bin/env python3
# analysis.py

...
dataset = pooch.retrieve("https://data.neuropoly.org/datasets/sct-testing-large/archive/refs/tags/5.4.zip")

User:

git clone https://data.neuropoly.org/analysis
cd analysis
./analysis.py

Put a dataset download command in the README:

Publisher writes:

# Installation

First, get the dataset by

    curl https://data.neuropoly.org/datasets/sct-testing-large/archive/refs/tags/5.4.zip && unzip 5.4.zip

or

    git clone --depth 1 -b 5.4 https://data.neuropoly.org/datasets/sct-testing-large

Then, install the analysis code by

    git clone https://data.neuropoly.org/analysis && cd analysis && pip install -e .

Then run

    analysis -i ../5.4/ ...

User: follows the instructions as written.

An advantage of this method is the human factor: it makes the relationships between the components less abstract especially for students getting used to the idea of making reproducible science.

jcohenadad commented 2 years ago

as always, thank you for the thorough investigations @kousu.

i like https://github.com/neuropoly/data-management/issues/136#issuecomment-941575717. I guess it does not solve the issue you raised earlier about the error thrown if the folder already exists, but given this is not all embedded into an opaque script, people will likely realize the cause of the error and manually remove the existing data folder.

kousu commented 2 years ago

https://huggingface.co/ was pointed out to me today. It's a cloudy-startup that does model hosting.

It seems to basically be a git server with git-lfs turned on: the models get stored in LFS to make downloads friendlier (though #68 would have solved that too for them), and there's a button that will run it on an Amazon VPS for you. But there doesn't seem to be any requirement to upload associated pytorch/torchvision/etc preprocessing code or even the original training scripts that produced the trained weights so it's unclear to me how many of the models hosted here are actually self-contained.

kousu commented 1 year ago

Some ideas in here https://goodresearch.dev/pipelines.html about versioning experiments-as-code.

Its suggestions are basically:

Use a cloud service:

As for the offline world, it gestures at:

Our old friend Datalad

jcohenadad commented 1 year ago

FWIW, some students are already using wandb.ai for ivadomed-related projects. We even set up some instructions for newcomers in the lab

kousu commented 1 year ago

FWIW, some students are already using wandb.ai for ivadomed-related projects. We even set up some instructions for newcomers in the lab

Yep! I saw. I first saw it over there, before seeing it on xcorr's page.

You can probably already tell but I don't think any cloud options are a good idea in the long-term. SaaS can only be sustainable if its business strategy is lock-in. Can we export and self-host our records from there and republish them from WanDB elsewhere? Probably not. Even if we could, all our papers would cite the https://wandb.ai URLs so in practice it would be extremely difficult to unlock ourselves from.

kousu commented 9 months ago

https://github.com/jvns/git-commit-folders would be really really useful to integrate as a standard harness for training scripts; see https://github.com/neuropoly/data-management/issues/68#issuecomment-1858441789

neuropoly / data-management

Integrating `git` into training scripts #136