ropensci / EDIutils

An API Client for the Environmental Data Initiative Repository
https://docs.ropensci.org/EDIutils/
Other
10 stars 2 forks source link

enhance input to api_get_provenance_metadata to accept urls and dois #17

Open srearl opened 4 years ago

srearl commented 4 years ago

api_get_provenance_metadata is a fantastic resource but I ran into a case where I needed to access provenance information but had the doi and/or url of the dataset rather than the project identifier (e.g., knb-lter-xxx.x.x). Below is an R-based MRE using a dataset from BNZ that I used to address this task but it seems that the utility of api_get_provenance_metadata would be increased if it would natively accept a dataset doi or url in addition to the project ### identifier.

MRE (in R):

library(rvest)
library(EDIutils)
library(EML)
library(dplyr)
library(stringr)

url <- "https://doi.org/10.6073/pasta/31b32868ddbb099c4b5480fb00eb2481"

landingPage <- read_html(url)

pageSubset <- landingPage %>%
  html_nodes(".no-list-style") %>%
  html_text()

packageId <- str_extract(grep("knb-lter-", pageSubset, value = TRUE)[[1]], "^\\S*")

packageProv <- emld::as_emld(EDIutils::api_get_provenance_metadata(packageId))
packageProv$`@context` <- NULL
packageProv$`@type` <- NULL

# desired output
packageProv 
clnsmth commented 4 years ago

Thanks for this suggestion @srearl! I agree that DOIs and URLs may be more common to users but I'm a little wary of adding (and maintaining) support for DOI and URL inputs to this function because:

1.) It creates a precedent for extending support to all other API functions 2.) URLs (if you mean data package URLs) these may change and break workflows 3.) Package ID is conspicuously listed on the data package landing page

Can you tell me more about your use case and why package IDs may be challenging for users?

srearl commented 4 years ago

Hi Colin,

  1. I do not know this package very well but can sympathize with this point.
  2. True and definitely a point to consider but they change very rarely (I think rarely, anyway). Perhaps a compromise here, if of interest to explore this further, would be to support DOIs but not URLs. In my case (below), I was provided mostly DOIs.
  3. Indeed. However, the reason that this became an issue for me is that I was provided ~30 DOIs (and a few URLs) - too many to be practical to visit each landing page and harvest the package ID. The MRE that I provided was pulled from a script that I used to loop over the list.
clnsmth commented 4 years ago

Agreed @srearl, manually parsing that list would be onerous. I'm moving this into the queue with the caveat that it should be implemented for all EDI API functions in this package.

clnsmth commented 2 years ago

The least intrusive implementation here might be a mapping function that takes one of:

and returns the other two IDs, which can be passed to downstream functions.