neuropoly / data-management

Repo that deals with datalad aspects for internal use
4 stars 0 forks source link

Integrating `git` into training scripts #136

Open kousu opened 2 years ago

kousu commented 2 years ago

There's about a hundred tools for tracking scientific data in processing pipelines.

Currently in the lab we are mostly not writing down the datasets we're using anywhere, or maybe passing them in as arguments to training scripts, or sometimes writing downs paths (e.g. https://github.com/ivadomed/ivadomed/blob/3e3989d408e5cc85cbc50354dc13e1b460648dbc/ivadomed/config/config_vertebral_labeling.json#L8) to data on our internal server. We should settle on an approach that lets us reproduce analyses, without anyone needing to go hunting for missing data. And we should do it with git, because that's the whole long-term plan: to integrate a way to version our datasets alongside our code.

kousu commented 2 years ago

Plus, what I haven't seen anywhere in the science scene, but what's worth mentioning, is using a package-manager: e.g. many video games are packaged with a seperate -data package (e.g. 0ad-data. This is roughly my plan for SCT: https://github.com/spinalcordtoolbox/data-template/pull/1

kousu commented 2 years ago

In the survey of approaches, I see basically two options:

  1. Keep the dataset as a component that is automatically downloaded by a tool: a submodule with git, a package with apt/pip/conda/brew.

    $ tool install <protocol>://project/with/analysis/scripts
    $ run-analysis-script
  2. Make downloading the data an idempotent one-liner as the top of your script, e.g.

    # at the top of: run-analysis-script
    dataset = downloader.get_data("datasetname", "v1.0.2");
    $ run-analysis-script
kousu commented 2 years ago

In the second vein, there's this option with git:

git clone --depth 1 -b v1.0.2 git@data.neuro.polymtl.ca:datasets/data-set-name.git

This will download the least amount of data-set-name/ needed: only the parts relevant for version 1.0.2.

This has some problems though:

There's a bunch of ways we could think of to fix each of these -- checking if the directory exists first and handling it differently if so; rm -rf'ing it first, etc).

Here's what I've come up with:

git_clone_idempotent_branch() {
  branch=$1; shift
  repo=$1; shift
  dir=$1

  if [ -z "$dir" ]; then
    dir=$(basename $repo) # XXX not quite right but close enough
  fi

  (
  set -e
  mkdir -p "$dir" && \
  cd "$dir" && \
  git init >/dev/null && \
  git remote add origin "$repo" 2>/dev/null ;
  ref="$(git ls-remote origin "$branch" | awk '{print $2}')" # should get either refs/heads/$branch or refs/tags/$branch
  git checkout --orphan _orphan 2>/dev/null  # avoids "fatal: Refusing to fetch into current branch $ref of non-bare repository"
  git fetch -f --depth 1 origin "$ref":"$ref" && \
  git checkout -f "$branch"
  )
}

If you do

git_clone_idempotent_branch v1.0.2 git@data.neuro.polymtl.ca:datasets/data-set-name

Then it behaves like git clone -b v1.0.2 --depth 1 git@data.neuro.polymtl.ca:datasets/data-set-name, except that it can handle switching branches, while still only downloading the minimum needed. If you switch a branch it will only download the .git/objects/ that are different between that branch and the previously existing one.

It has at least one bug, which is that it doesn't translate the repo URL into a local dir name the same way that git does: it doesn't strip the .git, it might be suceptible to directory traversal attacks, etc. So that should be examined. And the --orphan _orphan line is janky. But it should work, generally.

Just place this at the top of your processing scripts and we won't need to worry about tracking down datasets again.

kousu commented 2 years ago

But the first approach, using git submodule or the equivalent, has many advantages:

But these disadvantages:

kousu commented 2 years ago

After talking to @jcohenadad I realize there's a third approach too.

In summary:

  1. Use a dependency system (e.g. git submodule):

    Publisher:

    cd analysis/
    git init
    git submodule add https://data.neuropoly.org/datasets/sct-testing-large

    User:

    git pull --recurse-submodules https://data.neuropoly.org/analysis  # downloads the dataset implicitly
    cd analysis
    ./analysis.py
  2. Embed the data downloading into the processing script:

    Publisher writes:

    #!/usr/bin/env python3
    # analysis.py
    
    ...
    dataset = pooch.retrieve("https://data.neuropoly.org/datasets/sct-testing-large/archive/refs/tags/5.4.zip")

    User:

    git clone https://data.neuropoly.org/analysis
    cd analysis
    ./analysis.py
  3. Put a dataset download command in the README:

    Publisher writes:

    # Installation
    
    First, get the dataset by
    
        curl https://data.neuropoly.org/datasets/sct-testing-large/archive/refs/tags/5.4.zip && unzip 5.4.zip
    
    or
    
        git clone --depth 1 -b 5.4 https://data.neuropoly.org/datasets/sct-testing-large
    
    Then, install the analysis code by
    
        git clone https://data.neuropoly.org/analysis && cd analysis && pip install -e .
    
    Then run
    
        analysis -i ../5.4/ ...

    User: follows the instructions as written.

    An advantage of this method is the human factor: it makes the relationships between the components less abstract especially for students getting used to the idea of making reproducible science.

jcohenadad commented 2 years ago

as always, thank you for the thorough investigations @kousu.

i like https://github.com/neuropoly/data-management/issues/136#issuecomment-941575717. I guess it does not solve the issue you raised earlier about the error thrown if the folder already exists, but given this is not all embedded into an opaque script, people will likely realize the cause of the error and manually remove the existing data folder.

kousu commented 2 years ago

https://huggingface.co/ was pointed out to me today. It's a cloudy-startup that does model hosting.

It seems to basically be a git server with git-lfs turned on: the models get stored in LFS to make downloads friendlier (though #68 would have solved that too for them), and there's a button that will run it on an Amazon VPS for you. But there doesn't seem to be any requirement to upload associated pytorch/torchvision/etc preprocessing code or even the original training scripts that produced the trained weights so it's unclear to me how many of the models hosted here are actually self-contained.

Related:

kousu commented 1 year ago

Some ideas in here https://goodresearch.dev/pipelines.html about versioning experiments-as-code.

Its suggestions are basically:

Use a cloud service:

As for the offline world, it gestures at:

jcohenadad commented 1 year ago

FWIW, some students are already using wandb.ai for ivadomed-related projects. We even set up some instructions for newcomers in the lab

kousu commented 1 year ago

FWIW, some students are already using wandb.ai for ivadomed-related projects. We even set up some instructions for newcomers in the lab

Yep! I saw. I first saw it over there, before seeing it on xcorr's page.

You can probably already tell but I don't think any cloud options are a good idea in the long-term. SaaS can only be sustainable if its business strategy is lock-in. Can we export and self-host our records from there and republish them from WanDB elsewhere? Probably not. Even if we could, all our papers would cite the https://wandb.ai URLs so in practice it would be extremely difficult to unlock ourselves from.

kousu commented 9 months ago

https://github.com/jvns/git-commit-folders would be really really useful to integrate as a standard harness for training scripts; see https://github.com/neuropoly/data-management/issues/68#issuecomment-1858441789