spinalcordtoolbox / PAM50

https://github.com/neuropoly/spinalcordtoolbox
2 stars 1 forks source link

Migrate PAM50 to pip #1

Open kousu opened 3 years ago

kousu commented 3 years ago

Part of https://github.com/neuropoly/spinalcordtoolbox/issues/2669

kousu commented 3 years ago

i am not sure if i understand everything correctly, but i notice file name changed for e.g.:

spinalcordtoolbox/data/PAM50/atlas/PAM50_atlas_00.nii.gz 

so what happens if, e.g. someone has SCT installed under, e.g.: sct-v3.4.4 instead of spinalcordtoolbox. Will it still work? I suppose we are talking about the hierarchy of the python package, and there, it will always be spinalcordtoolbox, right?

Instead of just saying `sct/data/PAM50/atlas/PAM50_atlas_00.nii.gz' we'll use

import spinalcordtoolbox.data.PAM50

...

importlib.resources.path(spinalcordtoolbox.data.PAM50, 'atlas/PAM50_atlas_00.nii.gz').__enter__() # maybe a helper to shorten this...

This also works:

import spinalcordtoolbox.data.PAM50.atlas

...

importlib.resources.path(spinalcordtoolbox.data.PAM50.atlas, 'PAM50_atlas_00.nii.gz').__enter__() # maybe a helper to shorten this...

The advantage of this is:

  1. You get an ImportError if the data isn't installed, which should be a lot clearer where the problem is
  2. pip keeps a cache in ~/.cache/pip of previous versions of packages, so reinstalling doesn't require redownloading. And this is safe because it keys them by hash and version string, so it can't be fooled. If the new version of SCT keeps using the old version of the data, no one needs to redownload it.
    1. and we can exploit this to use CI caching to save a lot of bandwidth on testing
  3. The data is contained under site-packages/ (so, currently, $SCT_DIR/python/envs/venv_sct/lib/python3.6/site-packages/spinalcordtoolbox/data) so it can't get muddled up with a different install.
  4. In principle we can use RECORD to integrity check the datasets -- or even the entire spinalcordtoolbox -- at any time, even after it's installed.
    • This sounds like a good thing to build into sct_check_dependencies

The disadvantage is:

  1. You can't just go look at the data like you can when it's in your working directory, so pip isn't great for courseware like course_beijing or sct-example-data. I don't know what we should do about those. My plan for the next, like, 72 hours is to just leave them alone and keep them in sct_download_data. Maybe that stuff should just be curl | tar? Maybe we can talk about that today?