usnistgov / oar-pdr

The NIST Open Access to Research (OAR) Public Data Repository (PDR) system software
12 stars 10 forks source link

Feature Request: implementation of a pooch downloader for the PDR #350

Open jat255 opened 1 week ago

jat255 commented 1 week ago

Hi @RayPlante, not sure if this is the best place to put this, but I was hoping the team might consider a small effort to implement/work with the pooch project to support the programmatic downloading of data from data.nist.gov. If you're not familiar, pooch has become a pretty widely-used tool in the scientific Python community for downloading datasets and other web resources, with tools for built in caching and some other nifty tricks.

The coolest part (to me) is the DOIDownloader class that allows you to say pooch.retrieve("doi:10.6084/m9.figshare.14763051.v1/tiny-data.txt"), and it will parse the DOI and download the underlying data all at once. Currently, there is support in this class for figshare, Zenodo, and Dataverse instances. I think adding support for the NIST PDR could do a lot for interoperability.

One use case internally: our package ETSpy has a few datasets included for testing and demonstration that are currently distributed with the package (not ideal, as it bloats the size of the package). The common way of dealing with this is to host the files in a repo somewhere and then use pooch to fetch them on demand as-needed and cache for later use. Most commonly, Zenodo is used for this, but since it's a NIST project, it would be preferred (required?) to host those in the PDR. Being able to easily use pooch for that with a DOI would be great.

Assuming the pooch team is open to it, I may have some cycles to work on this interoperability bit if it's of interest to the team.

jat255 commented 1 week ago

I just went ahead and did it today: https://github.com/fatiando/pooch/pull/442

GRG2 commented 1 week ago

@jat255 this looks interesting, very cool.