ropensci-archive / wishlist

:no_entry: ARCHIVED :no_entry:
https://discuss.ropensci.org/c/wishlist/6
50 stars 4 forks source link

datadoi: simple repository data access #31

Closed noamross closed 2 years ago

noamross commented 6 years ago

I started a stub repo for this: https://github.com/ropenscilabs/doidata. From the README:

Introduction

At rOpenSci and in associated open science groups, we often encourage scientists to deposit and data in public repositories that have stable, long-term archival infrastructure and robust metadata. Such repositories include Zenodo, Figshare, Dryad, and a variety of more specific ones. A frequent mode of use is to download files from these repositories, break the link with the original version or metadata, and include some portion or derived form of these data in a new project folder. This leads to fragmentation of data.

One of the reasons for this use mode is that API navigation of these repositories can be daunting or overly complex. On the other hand, R data packages are a popular way to distribute data to make it very easy to use, but this is R-specific and breaks connections to archival repositories.

The aim of this package is to simplify the workflow of using archived data to a single line, like so:

my_data <- datadoi::get_data("10.1234/somerecord987/FILENAME.csv")

This would parse the DOI, navigate the repository API (Figshare, Zenodo, Dryad, Open Science Framework, etc.) to find the associated file, and download it. If the repository has metadata describing how the data should be parsed, it will be used. Otherwise it can guess using rio or take an argument to return the information raw or write it to disk.

Some notes:

  • Non-archival/DOI-granting sources (GitHub, data.world) could be supported, these would be secondary as the goal would be encourage use archival repositories
  • Versioning would be handled on the repository side, though get_data() could take a version= argument for those repos that have versions but not versioned DOIs (e.g., Figshare)
  • Download cacheing would be optional
  • Github-linked Zenodo repositories unfortunately don't store files individually, but as a single ZIP of the GitHub release. The package should detect this, download, (cache), unzip and retrieve individual files automatically.
  • There might an option or function to return a citation, too, though mostly the idea here is that by keeping the DOI in the code you maintain a conceptual link the original record.
  • This probably requires up-to-date packages on the key repositories (Figshare, Zenodo, OSF. Dryad, DataONE), though quick-and-dirty methods might be doable for some repositories rather than wait for full-blown API clients.
rgangu commented 6 years ago

any progress on this?

sckott commented 6 years ago

@rgangu there is a repo started, did you check it out yet? there's not much there yet, but there are some discussions happening if you want to dive in to those https://github.com/ropenscilabs/doidata/issues

maelle commented 2 years ago

Thank you!

Note that future ideas should go to our wishlist forum category.