Use pooch library to manage test files

ericpre commented 2 years ago

Describe the functionality you would like to see.

Use pooch to manage test files and avoid packaging it, while still being able to run the test suite from distribution.

Describe the context

https://github.com/hyperspy/hyperspy/issues/2816 https://github.com/hyperspy/rosettasciio/pull/11#issuecomment-1216487452

Additional information

@hakonanes has experience with using it in kikuchipy.

hakonanes commented 2 years ago

We've added a data module to both orix (module) and kikuchipy (module). I mention the former since it's easier to follow the use of pooch due to fewer datasets. See the data module part of the orix contributing guide for details (similar to kikuchipy's).

CSSFrancis commented 1 year ago

I looked into this a little bit this morning.

Some initial thoughts.

We should make another repo (rosettasciio-data) to hold the data. Do we care about the git history for the files? I can just create the new repo and copy the files over.
For the .zspy files we probably need to zip the files otherwise we will end up with a large number of calls to download the data. I'd recommend we don't do that for any other datasets because it will be harder to track when files are added or changed.
@hakonanes is there a reason you have the base url set to "" rather than "https://github.com/pyxem/orix-data/raw/main/" here

My thought was to
create a new repo and copy the data over
Create a new PR testing based on the copied data
Delete the existing data from Rosettasciio repo

I've written a script to create a registry of the files so from this point things should be fairly simple

hakonanes commented 1 year ago

@hakonanes is there a reason you have the base url set to "" rather than "https://github.com/pyxem/orix-data/raw/main/" here

In orix' case this would indeed simplify the links in the registry. I used pooch the first time in kikuchipy, where files in the registry come from different sources (the package itself, the kikuchipy-data repo and various Zenodo repos), and therefore didn't have any base URL to point to. I then just fitted kikuchipy's use of pooch to orix, hence the use of no base URL.

hakonanes commented 1 year ago

If we intend to make the RoscettaSciIO data module or parts of it part of the public API, the docs should take some inspiration from the scipy.datasets module.

hakonanes commented 1 year ago

The SciPy datasets module has a clear_cache() function (not in kikuchipy or orix, should add this), which is useful because we want to avoid having a cache full of the same files but in different directories corresponding to different versions of Rosetta.

CSSFrancis commented 1 year ago

If we intend to make the RoscettaSciIO data module or parts of it part of the public API, the docs should take some inspiration from the scipy.datasets module.

I don't think that making the data public is terribly helpful. Most of the data is empty or very sparse/small. If we wanted something like a data module it's probably better to have that in downstream packages.

The SciPy datasets module has a clear_cache() function (not in kikuchipy or orix, should add this), which is useful because we want to avoid having a cache full of the same files but in different directories corresponding to different versions of Rosetta.

That is a good idea. I can add that to the list of things to add.

CSSFrancis commented 1 year ago

It might also be worth thinking about how new data is added for new tests and how that works with the CI.

Do people submit a PR to the rosettasciio-data branch first and then it is merged before the PR is merged here?
Do we create a new branch to the rosettasciio-data branch with each PR? This is something we could handle with a Github action or a label.

hakonanes commented 1 year ago

I'd go with two separate PRs in the beginning, and then later on see if there is a benefit to automate parts of the process. It might also be that adding a dataset function is not as straight-forward as a 1-1 mapping to a particular file in the data repo, which might not be as easy to automate.

CSSFrancis commented 1 year ago

So for the CI then you would have something like:

Check to see if there is an open PR in rosettasciio-data with the same name as the PR in rosettasciio
If yes then point Pooch to that Repo to load the data
If no then point Pooch to the main repo.

It's not the best process but I think it would work. It's going to end up being a little funny regarding merging. For example if the PR for the rosettasciio-data isn't merged first than the CI will fail once the rosettasciio PR is merged.

ericpre commented 1 year ago

It might also be worth thinking about how new data is added for new tests and how that works with the CI.

Do people submit a PR to the rosettasciio-data branch first and then it is merged before the PR is merged here?

Do we create a new branch to the rosettasciio-data branch with each PR? This is something we could handle with a Github action or a label.

Yes, this is the important part. :)

Is it necessary to have a separate repository? If not, what is the advantage? Is it to simplify avoid having many version of the cache when developing rosettasciio?

Can the workflow be simplify as:

upload the data somewhere: it can a PR to a git repository or any other supported repository
If step 1 involve a PR, get the PR merged
in the PR to rosettasciio, which the new data, add this test data to the registry with link and sha, etc.

I don't know how pooch work but from what I read this is what I would expect.

CSSFrancis commented 1 year ago

It might also be that adding a dataset function is not as straight-forward as a 1-1 mapping to a particular file in the data repo, which might not be as easy to automate.

So I think that calling something like the fetch method in orix at the top of each of the test files is probably the right thing to do.

path_dict = {path: _fetch(path) for path in tia_files]

# loading the file in some test

def test_hs_load():
    hs.load(path_dict["64x64_TEM_images_acquire.emi"]

CSSFrancis commented 1 year ago

So @hakonanes can probably answer better than I can but I'll give it a shot.

Is it necessary to have a separate repository? If not, what is the advantage? Is it to simplify avoid having many version of the cache when developing rosettasciio?

The separate repository is just to have a repository of data somewhere. We could also host the data someplace like Zenodo or any other number of places. We can also link to data inside of rosettasciio:

Something like:

import pooch

odie = pooch.create(
    # Use the default cache folder for the operating system
    path=pooch.os_cache("rosettasciio"),
    base_url="https://github.com/hyperspy/rosettasciio/blob/main/rsciio/tests/",
    # The registry specifies the files that can be fetched
    registry={'msa_files/example2.msa': '2367c625c23afde0f4419c187e5e6c7cc5b6a1f5b77827ba7a0f47ab4993f20b'},
)

path = odie.fetch('msa_files/example2.msa') # Retrieve the data if not in cache
print(path) = '../cache/rosettasciio/msa_files/example2.msa'

Will set up a link to 'msa_files/example2.msa' to .cache/rosettasciio/example2.msa`

Pooch It will go fetch the data from wherever as long as we give it a URL.

I think the intention is to decrease the size of Rosettasciio, correct? Which requires that we move the data, Maybe we should think about if reducing the size of the test module makes sense. The test datasets shouldn't be downloaded except if you are doing development so I don't know if there is too much reason not to include the test data.

Can the workflow be simplify as:

upload the data somewhere: it can a PR to a git repository or any other supported repository

If step 1 involve a PR, get the PR merged

in the PR to rosettasciio, which the new data, add this test data to the registry with link and sha, etc.

I don't know how pooch work but from what I read this is what I would expect.

Yea something like that would probably work. More than anything we probably need to have a defined workflow for how things should work.

jlaehne commented 1 year ago

I think the intention is to decrease the size of Rosettasciio, correct? Which requires that we move the data,

Well could we keep the data in the same repository, but exclude it from releases?

If we move all data files to one folder data that we keep in the main repository, we should be able to use the export-ignore command in .gitattributes to avoid shipping that folder with a release. Then pooch can fetch the data from there if needed, but it stays in the same repo so that we avoid complicating the merge procedure (see e.g. https://softwareengineering.stackexchange.com/a/367655)

CSSFrancis commented 1 year ago

If we move all data files to one folder data that we keep in the main repository, we should be able to use the export-ignore command in .gitattributes to avoid shipping that folder with a release. Then pooch can fetch the data from there if needed, but it stays in the same repo so that we avoid complicating the merge procedure (see e.g. https://softwareengineering.stackexchange.com/a/367655)

This seems like a good way to handle things. Just so I understand this, the export-ignore only refers to releases etc. If I was to clone the repo I would still get the test files?

In that case is there much of a reason to have pooch fetch the data? Maybe if we want to make the .data attribute public but I kind of feel like no one really wants to see any of the test data. It would be far more useful to make a curated repo of good datasets and have those available for people to play with.

hakonanes commented 1 year ago

I don't know how pooch work but from what I read this is what I would expect.

@ericpre, yes, Pooch is quite straight-forward to use. One caveat I encountered with kikuchipy was that I wanted to update a file already registered with Pooch. In cases like this it is important that files are served from permanent locations, like a permanent link from GitHub (pointing to a commit, as specified here) or a version of a Zenodo repo. One can then just update the file and point the registry to the new permanent link or Zenodo repo version.

Well could we keep the data in the same repository, but exclude it from releases?

An alternative to git attributes is to specify files to include/exclude in a MANIFEST.in file or similar.

ericpre commented 1 year ago

To be more precise, I think that the use case we are interested in is:

don't include test data in the distribution file to reduce package file to distribute on pypi, conda-forge. There are different ways to do that - see for example https://packaging.python.org/en/latest/tutorials/packaging-projects - and this is unrelated to git and git shouldn't be used for that.
keep the full repository in the github tag/release for archive purposes - it doesn't matter if these file are large as users/contributors/CI doesn't interact with these much. However, it is good practice to keep the whole repository for archive purposes.
the test data can live anywhere as long as we are confident in their long-term accessibility: it can be in the same github repository, different git repository (github, gitlab), etc.,
Be able to run the full test suite from any installation for testing purpose and not only from a development install and this is where the use of pooch is useful

Would it be possible to have such a function to access data:

def get_file_path(file_path):
   # `file_path` is the path to the file relative to test directory
   if os.path.exists(path):
       return file_path
   else:
       # download the file to the cache and get the path
       path = ...
       return path

It may need some tweaks to play well with the pooch registry but if this is possible, then it means that when adding test file in a PR, we will only need updating the pooch registry and not create an additional PR. In development install from local git repository, the file will also be available through git and not download anything and for released version, the file will be downloaded and accessed from the cache.

hakonanes commented 1 year ago

Your example snippet is exactly what we do in kikuchipy. We can bypass Pooch, but still have to check whether the hash is correct.

ericpre commented 1 year ago

Thanks @hakonanes for the information. In the case of rosettasciio (test data file), do we need to check the hash? Git will manage the file and actually, in development scenario, this is rather the opposite, we may want to change the file and run the test suite against the new file. CI will use the local file after checking out the repository using git and pooch should only be used in the conda-forge feedstock or to check user installation, etc.

hakonanes commented 1 year ago

The hash check is mainly a security measure, which I guess can be bypassed in a test where the file origin can be verified otherwise.

CSSFrancis commented 1 year ago

Just so I get this right the idea is to:

Keep all of the current data files stored in the repo.
Add the ability to source files remotely using pooch
Make sure extra files aren't packaged with release.

Kind of wondering if there is a good use case for 2 or if we are reaching there. I like the concept of remote data storage and it is necessary if you are trying to store things over 100 mbs. What I don't love is having multiple sources all over because that seems hard to track.

A couple of things to consider.

On a major release the test datasets are archived. Then those files are removed from the repo. We first check to see if there is a local file and if not then fall back to downloading the archived one. This probably requires some kind of fancy GitHub actions
Split rosettasciio. We could have separate repos for different file types (STEM, EBSD, Spectral,etc..) which could make it easier to create a feed stock for the main rosettasciio repo. It would also allow extensions to be developed.
Keep things as they are but add pooch for those who might need it.
Keep things as they are and force people to use small files/ compress or zip files.

I'm more for 3 or 4. It might be worth seeing if zipping the data files makes sense. Most of the files are 0 so we could probably reduce their size by a factor of 5 at least. Pooch just seems to add lots of complications. It is great for things like adding data for api access but for testing data it seems like it falls flat

ericpre commented 1 year ago

What I don't love is having multiple sources all over because that seems hard to track.

There will only one source, the only difference is how to access it: either local (development scenario) or through pooch when it is available in the package (distribution scenario)

On a major release the test datasets are archived. Then those files are removed from the repo. We first check to see if there is a local file and if not then fall back to downloading the archived one. This probably requires some kind of fancy GitHub actions

There should be anything to done for this point and there is nothing special regarding the life cycle of the test data file in the git repository. Maybe to be more clear, there is an example with edax test files, which is done manually and pooch will standardise it so that it would be easier to extend its usage.

The main point is to reduce the size of the wheels and source dist and not being limited by then.

I'm more for 3 or 4.

Yes, 3 and 4. :)

It might be worth seeing if zipping the data files makes sense. Most of the files are 0 so we could probably reduce their size by a factor of 5 at least.

We had a look at this and this is done for some test file (for example FEI emd) but it isn't convenient in development scenario... The problem is that it is quite easy to have files which are in the 50-200KB range and it adds up...

Pooch just seems to add lots of complications. It is great for things like adding data for api access but for testing data it seems like it falls flat

Yes, this is a very valid question. I expect that the initial effort to set it up will be useful to also add a registry for dataset. In my opinion, we need to be able the files installed in different environment and at the same distribute wheels or conda packages which are 30 MB or is not good.

jlaehne commented 1 year ago

Hurray, @ericpre started on implementing this approach: see #123

hyperspy / rosettasciio