11 Accelerated orientation mapping with template matching: Where to find the input dataset

uellue commented 1 year ago

The notebook 11 tries to load input data like this: experimental_data = hs.load("data/sample_with_g.hspy", lazy=True). I couldn't find the data in the repository or in the Google drive that is linked in the README. I also didn't find any clues in the notebook itself.

CSSFrancis commented 1 year ago

@uellue Thanks for the bug report, I realized that I haven't been watching the pyxem-demos bug reports very closely so sorry for the delay.

@din14970 do you have the dataset somewhere that we can link to?

din14970 commented 1 year ago

It's been a long time. I vaguely recall there was a google drive or something where the data initially resided. At the time I didn't really know the best practices around managing data.

In any case, I went back through my local archives and found the relevant files. I've uploaded them as a zip attachment here https://drive.google.com/file/d/1CaWSfwNoupmZTTBm5VTVbTb0uSt_3Phm/view?usp=drivesdk. What are you using now to put the data into the repo, git lfs?

CSSFrancis commented 1 year ago

It's been a long time. I vaguely recall there was a google drive or something where the data initially resided. At the time I didn't really know the best practices around managing data.

@din14970 No worries! I think that it got lost at some point in time. I compressed the data by a fair amount in order to get things to fit into the GitHub Repo in hopes that we could be a little bit better at keeping the data in one place and stop things like this from happening.

In any case, I went back through my local archives and found the relevant files. I've uploaded them as a zip attachment here https://drive.google.com/file/d/1CaWSfwNoupmZTTBm5VTVbTb0uSt_3Phm/view?usp=drivesdk. What are you using now to put the data into the repo, git lfs?

Going to be honest I didn't realize that github had an option for Large File Storage! In any case as long as the entire Repo is less than 5GB and each individual file is less than 100 mb GitHub doesn't have any problems and you don't have to deal with LFS so we try to keep files smaller than 100 mb.

Eventually the plan is to use pooch and maybe have the data available directly through calls in pyxem.

Because @uellue is here I wonder if it might be possible to make a repository of good compressed 4D STEM datasets that are easily downloadable and available in different formats. For most people these days I think a sweet spot would be around 500mb compressed. That can pretty easily be around 3 GB uncompressed so it still feels like a 4D STEM dataset without taking a long time to download.

Zenodo will host up to 50GB of data so we could have larger datasets and then just use pooch.

CSSFrancis commented 1 year ago

@din14970 Do you have the dataset that you used for the paper as well? It might be nice to host it on something like the Materials Data Facility . They seem to be the best at handling whatever size of dataset you want. They also have a globus endpoint so it's fairly easy to transfer larger files.

din14970 commented 1 year ago

Yes, they are available and described as well as I could on Zenodo. I link to them in the readme of the associated notebook repo: https://github.com/din14970/pyxem_template_matching_workflows.

din14970 commented 1 year ago

we try to keep files smaller than 100 mb.

The example files for notebook 11 are smaller than 100 mb normally. It's a small piece cropped from the larger scan used in the paper.

CSSFrancis commented 1 year ago

The example files for notebook 11 are smaller than 100 mb normally. It's a small piece cropped from the larger scan used in the paper.

Yea I added it in #85 I also changed the dataset to a zarr file as I am trying to push people in that direction :)

CSSFrancis commented 1 year ago

Closed by #85

pyxem / pyxem-demos

11 Accelerated orientation mapping with template matching: Where to find the input dataset #83