pyxem / orix

Analysing crystal orientations and symmetry in Python
https://orix.readthedocs.io
GNU General Public License v3.0
79 stars 45 forks source link

Af96 datasets #409

Closed argerlt closed 1 year ago

argerlt commented 1 year ago

Description of the change

Addy 96 EBSD scans taken from AF96 high strength low alloy steels. Files are too large for a free github account, so uses google drive and the gdown package instead of pooch and github LFS.

Working code but no tests yet. Also requires some warnings, cleanup, etc, before being ready for approval.

Also, this PR is a 50/50 collaboration between myself and Ashley Lenau.

Progress of the PR

Minimal example of the bug fix or new feature

example that downloads 20 individual AF96 ebsd scans, each 512 by 512

    >>> from orix import data
    >>> xmaps = af96_martensitic_steels(dataset='small', subset=np.arange(20))
    >>> xmaps

For reviewers

argerlt commented 1 year ago

@hakonanes , I think I need help with this. I believe the checks are failing because I used the module gdown, which the checker doesn't recognize.

As a quick background: I converted a series of EBSD scans of AF96 Martensitic steel into orix-friendly .h5 files. These datasets have been used for a handful of papers such as this one, and are just a useful large dataset to have for verification of Machine Learning models. However, this dataset is too large to fit in a free github repo (one dataset is 5 maps of 2 million pixels, another is 90 512x512 scans), so I used google drive to host them instead, which has more lenient download options. Pooch does not work with Google drive, but gdown does.

hakonanes commented 1 year ago

which the checker doesn't recognize.

gdown is not available in the test environment. It must be listed as a dependency in the install_requires list in setup.py.

As an aside, I consider gdown an OK dependency of orix. It carries an MIT license and all its dependencies are already dependencies of orix or other packages installed with pip install orix[dev].

The intention when we introduced the orix.data module was for it to contain data used in the documentation and/or tests of orix (the module's one-line docstring description reads "Test data"). I know you've talked about contributing (M)ODF functionality and such to orix (great!). Is the intention to use this data in the documentation and tests of this functionality? If so, would it be possible to wait to add these datasets to when that happens? If not, I think a viable alternative is to add an "Open datasets" page to the orix docs with links to open datasets and showing how to download them. We have pages for this as part of the pyxem (https://pyxem.readthedocs.io/en/stable/open_datasets_workflows.html) and kikuchipy (https://kikuchipy.org/en/stable/user/open_datasets.html) package docs. What do you think?

argerlt commented 1 year ago

Is the intention to use this data in the documentation and tests of this functionality?

No, intention was just to get the datasets out there, as they are a rather large collection of homogenous material and therefore handy for testing out ML code for things like prior-Austenite reconstruction.

I think a viable alternative is to add an "Open datasets" page to the orix docs with links to open datasets and showing how to download them

I like this idea. I could turn the download function and registry dictionary into it's own separate "AF96_Download_example.py" file stored in orix/docs/ folder, then link to the example code in an "Open_datasets.rst" page. Alternately, I could host the download example on google drive, but I would prefer to keep all the code in orix and only the data in the Google Drive.

hakonanes commented 1 year ago

No, intention was just to get the datasets out there

Have you considered uploading them to Zenodo? If you upload them file by file, they can be downloaded as such as well. (Just don't upload a single .zip file, which must be downloaded as a single .zip file as well!) A file can then be downloaded like this:

from urllib.request import urlretrieve
files = urlretrieve(
    url='https://zenodo.org/record/<record-number>/files/data.zip',
    filename='./downloaded_data.zip'
)

I think Zenodo is the best platform for permanent open storage of large datasets.

I could turn the download function and registry dictionary into it's own separate "AF96_Download_example.py" file stored in orix/docs/ folder, then link to the example code in an "Open_datasets.rst" page.

I think this is doable. I'm a little reluctant to link to an untested file, though. We should make sure it doesn't do very fancy parsing so that it works in the future as well.

If you add a file doc/open_datasets.rst in the same vein as kikuchipy's file, I can help update it and review.

Lastly, if we down the line find it useful to use any of these datasets in a tutorial in the docs, we can then add it to the data module!

pc494 commented 1 year ago

Zenodo is almost certainly the correct place for such a dataset, especially as people can then cite the dataset directly.

I'm not massively enthusiastic about automating file downloads in general, unless we have examples/automated processes that depend on doing so.

argerlt commented 1 year ago

I agree with both of you, Zenodo makes more sense, if for no other reason than because it's good to keep methods consistent amongst a package. Google Drive works, but it's not objectively superior to Zenodo in any way that justifies multiple download methods or using both gdown and pooch.
Unless someone objects, I'm going to make a new draft PR following the suggestions above so as to not clutter the commit history.

hakonanes commented 1 year ago

Uploading a file to Zenodo is the closest to making it available forever, I think. Zenodo sprung out from CERN, and the storage is hosted by them. Just to site one of their FAQs from https://help.zenodo.org/:

Is my data safe with you / What will happen to my uploads in the unlikely event that Zenodo has to close?

Yes, your data is stored in CERN Data Center. Both data files and metadata are kept in multiple online and independent replicas. CERN has considerable knowledge and experience in building and operating large scale digital repositories and a commitment to maintain this data centre to collect and store 100s of PBs of LHC data as it grows over the next 20 years. In the highly unlikely event that Zenodo will have to close operations, we guarantee that we will migrate all content to other suitable repositories, and since all uploads have DOIs, all citations and links to Zenodo resources (such as your data) will not be affected.

I plan to make a "Open datasets" page for the orix docs within the next few weeks. I think pointing to your files on Google Drive is the best option here. I'd be happy to review your PR!

argerlt commented 1 year ago

Closing this, then opening up a related issue before making another PR.