ropensci / EML

Ecological Metadata Language interface for R: synthesis and integration of heterogenous data
https://docs.ropensci.org/EML
Other
97 stars 33 forks source link

Function to convert DataCite metadata to EML: good fit for this package? #341

Open peterdesmet opened 2 years ago

peterdesmet commented 2 years ago

👋 I have written a function that converts DataCite metadata to EML, with the DOI as the parameter. I'm planning to use this for at least 2 packages and was wondering if this function would be a good fit to be added to the EML package?

Context

I'm in the process of publishing bird tracking datasets that I already published on Zenodo to GBIF, to open them up to a wider audience. One of the steps in the process is converting the dataset metadata to EML, which can then be uploaded to a GBIF IPT for publication. I don't want to do this manually, which is why I wrote a function. To make it more generic than Zenodo, I'm pulling the metadata from the DataCite.org API (rather than the Zenodo API), where all research repositories push metadata to if they want to mint a DOI.

Functionality

library(movepub)
doi <- "https://doi.org/10.5281/zenodo.5879096" # Also works as "10.5281/zenodo.5879096"
datacite_to_eml(doi)
#> $dataset
#> $dataset$title
#> [1] "O_WESTERSCHELDE - Eurasian oystercatchers (Haematopus ostralegus, Haematopodidae) breeding in East Flanders (Belgium)"
#> 
#> $dataset$abstract
#> $dataset$abstract$para
#> [1] "<![CDATA[<em>O_WESTERSCHELDE - Eurasian oystercatchers (Haematopus ostralegus, Haematopodidae) breeding in East Flanders (Belgium)</em> is a bird tracking dataset published by the Research Institute for Nature and Forest (INBO). It contains animal tracking data collected by the LifeWatch GPS tracking network for large birds (http://lifewatch.be/en/gps-tracking-network-large-birds) for the project/study <strong>O_WESTERSCHELDE</strong>, using trackers developed by the University of Amsterdam Bird Tracking System (UvA-BiTS, http://www.uva-bits.nl). The study has been operational since 2018. In total 13 individuals of Eurasian oystercatchers (<em>Haematopus ostralegus</em>) have been tagged in their breeding area in East Flanders (Belgium), west of the river Scheldt, mainly to study their habitat use on mudflats of the Western Scheldt (the Netherlands). Data are uploaded from the UvA-BiTS database to Movebank and from there archived on Zenodo (see https://github.com/inbo/bird-tracking). No new data are expected. Data in this package are exported from Movebank study <em>O_WESTERSCHELDE - Eurasian oystercatchers (Haematopus ostralegus, Haematopodidae) breeding in East Flanders (Belgium)</em> (Movebank Study ID 1099562810), which can be viewed at https://www.movebank.org/cms/webapp?gwt_fragment=page=studies,path=study1099562810. Fields in the data follow the Movebank Attribute Dictionary (http://vocab.nerc.ac.uk/collection/MVB) and are described in <code>datapackage.json</code>. <strong>Files</strong> Files are structured as a Frictionless Data Package. You can access all data in R via <code>https://zenodo.org/record/5879096/files/datapackage.json</code> using frictionless. <strong>datapackage.json</strong>: technical description of the data files. <strong>O_WESTERSCHELDE-reference-data.csv</strong>: reference data about the animals, tags and deployments. <strong>O_WESTERSCHELDE-gps-yyyy.csv.gz</strong>: GPS data recorded by the tags, grouped by year. <strong>O_WESTERSCHELDE-acceleration-yyyy.csv.gz</strong>: acceleration data recorded by the tags, grouped by year.]]>"
#> [2] "This dataset was collected using infrastructure provided by INBO and funded by Research Foundation - Flanders (FWO) as part of the Belgian contribution to LifeWatch. Additional funding was provided by the Sovon Dutch Centre for Field Ornithology."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
#> 
#> 
#> $dataset$contact
#> list()
#> 
#> $dataset$creator
#> $dataset$creator[[1]]
#> $dataset$creator[[1]]$individualName
#> $dataset$creator[[1]]$individualName$givenName
#> [1] "Geert"
#> 
#> $dataset$creator[[1]]$individualName$surName
#> [1] "Spanoghe"
#> 
#> 
#> 
#> $dataset$creator[[2]]
#> $dataset$creator[[2]]$individualName
#> $dataset$creator[[2]]$individualName$givenName
#> [1] "Peter"
#> 
#> $dataset$creator[[2]]$individualName$surName
#> [1] "Desmet"
#> 
#> 
#> $dataset$creator[[2]]$userId
#> $dataset$creator[[2]]$userId$directory
#> [1] "http://orcid.org/"
#> 
#> $dataset$creator[[2]]$userId[[2]]
#> [1] "0000-0002-8442-8025"
#> 
#> 
#> 
#> $dataset$creator[[3]]
#> $dataset$creator[[3]]$individualName
#> $dataset$creator[[3]]$individualName$givenName
#> [1] "Tanja"
#> 
#> $dataset$creator[[3]]$individualName$surName
#> [1] "Milotic"
#> 
#> 
#> $dataset$creator[[3]]$userId
#> $dataset$creator[[3]]$userId$directory
#> [1] "http://orcid.org/"
#> 
#> $dataset$creator[[3]]$userId[[2]]
#> [1] "0000-0002-3129-6196"
#> 
#> 
#> 
#> $dataset$creator[[4]]
#> $dataset$creator[[4]]$individualName
#> $dataset$creator[[4]]$individualName$givenName
#> [1] "Gunther"
#> 
#> $dataset$creator[[4]]$individualName$surName
#> [1] "Van Ryckegem"
#> 
#> 
#> $dataset$creator[[4]]$userId
#> $dataset$creator[[4]]$userId$directory
#> [1] "http://orcid.org/"
#> 
#> $dataset$creator[[4]]$userId[[2]]
#> [1] "0000-0001-8788-0001"
#> 
#> 
#> 
#> $dataset$creator[[5]]
#> $dataset$creator[[5]]$individualName
#> $dataset$creator[[5]]$individualName$givenName
#> [1] "Joost"
#> 
#> $dataset$creator[[5]]$individualName$surName
#> [1] "Vanoverbeke"
#> 
#> 
#> $dataset$creator[[5]]$userId
#> $dataset$creator[[5]]$userId$directory
#> [1] "http://orcid.org/"
#> 
#> $dataset$creator[[5]]$userId[[2]]
#> [1] "0000-0002-3893-9529"
#> 
#> 
#> 
#> $dataset$creator[[6]]
#> $dataset$creator[[6]]$individualName
#> $dataset$creator[[6]]$individualName$givenName
#> [1] "Bruno J."
#> 
#> $dataset$creator[[6]]$individualName$surName
#> [1] "Ens"
#> 
#> 
#> $dataset$creator[[6]]$userId
#> $dataset$creator[[6]]$userId$directory
#> [1] "http://orcid.org/"
#> 
#> $dataset$creator[[6]]$userId[[2]]
#> [1] "0000-0002-4659-4807"
#> 
#> 
#> 
#> $dataset$creator[[7]]
#> $dataset$creator[[7]]$individualName
#> $dataset$creator[[7]]$individualName$givenName
#> [1] "Willem"
#> 
#> $dataset$creator[[7]]$individualName$surName
#> [1] "Bouten"
#> 
#> 
#> $dataset$creator[[7]]$userId
#> $dataset$creator[[7]]$userId$directory
#> [1] "http://orcid.org/"
#> 
#> $dataset$creator[[7]]$userId[[2]]
#> [1] "0000-0002-5250-8872"
#> 
#> 
#> 
#> 
#> $dataset$metadataProvider
#> list()
#> 
#> $dataset$keywordSet
#> $dataset$keywordSet[[1]]
#> $dataset$keywordSet[[1]]$keywordThesaurus
#> [1] "n/a"
#> 
#> $dataset$keywordSet[[1]]$keyword
#>  [1] "animal movement"  "animal tracking"  "gps tracking"     "accelerometer"   
#>  [5] "altitude"         "temperature"      "biologging"       "birds"           
#>  [9] "LifeWatch"        "UvA-BiTS"         "Movebank"         "frictionlessdata"
#> 
#> 
#> 
#> $dataset$pubDate
#> [1] "2022-01-19"
#> 
#> $dataset$intellectualRights
#> [1] "cc0-1.0"
#> 
#> $dataset$alternateIdentifier
#> [1] "https://doi.org/10.5281/zenodo.5879096"

Created on 2022-05-03 by the reprex package (v2.0.1)

peterdesmet commented 2 years ago

The suggested datacite_to_eml() function is now documented at https://inbo.github.io/movepub/reference/datacite_to_eml.html. I think the EML package would be a better home for it.

mbjones commented 2 years ago

As another convenience method to template an EML record from existing metadata, this seems useful to me and I would support its inclusion. @cboettig woiuld you have any objections? If not, maybe @peterdesmet could submit a PR?

cboettig commented 2 years ago

:+1: yeah seems like this would be helpful! PRs welcome!

peterdesmet commented 2 years ago

Cool, I'll see when I have some time for that. The function relies quite a lot on the purrr package. Is it fine if this is added as a dependency?

mbjones commented 2 years ago

For the packages I maintain, I try to keep dependencies to a minimum, especially for large, packages or packages that entrain a complex ecosystem, as they usually cause maintenance headaches down the road. We spend a fair number of cycles just treading water on package dependencies trying to keep packages on CRAN. Backwards incompatible changes or a package being supplanted by a "newer" version (as is common for RStudio packages) has caused a lot of churn for us. That said, if you really need it, then that is what they are there for. But keep in mind that each dependency is a potential future upgrade problem.

peterdesmet commented 2 years ago

I share your sentiment. I'll see if I can replace my three uses of purrr::map_chr() and two uses of purrr::map() with a base R alternative, if it remains readable (code at https://github.com/inbo/movepub/blob/main/R/datacite_to_eml.R#L26-L49).

cboettig commented 2 years ago

map is essentially lapply, and map_chr is essentially vapply with a template type. e.g.

  keywords <- purrr::map_chr(metadata$subjects, "subject")
## is the same as
  keywords <- vapply(metadata$subjects, `[[`, character(1L), "subject")

(yes, [[ is a the familiar sub-setting function, recall in R everything is a function). (Not tested)

That said, purrr is a light dependency compared to some things EML already pulls in....