Taking some pain out of finding/linking to unique IDs?

magpiedin commented 6 years ago

A wish/need/dread for data standards came up in issue 41, and brought a few ideas to mind:

For cleaning Darwin Core/biodiversity data, there are some good tools (e.g., Kurator, which looks like it's getting some translation to R).
For finding IDs for publications, people, specimens, taxa, etc, there are lots of great resources (fulltext, rorcid, spocc, taxize...)
But for actually finding & linking the pieces (specifically, the unique IDs for publications, specimens, people, etc), projects often run out of energy/time/awareness

Any thoughts on a helper/gentle-reminder app or lesson for suggesting linkable values contained within datasets or papers -- for instance, by indexing what types of fields/records exist in a given dataset, and suggesting relevant packages from CRAN or ropensci that could retrieve identifiers?

I realize I'm glossing over some major obstacles to actually linking data (e.g., cleaning free text values & resolving entities is enough of a mountain; plan-ahead is better than fix-it-in-post when possible), so I'm all ears if this could use more [or a different] focus. Or if something magical already exists along these lines. ...Or if there's a good/sustainable alternative to developing tools/packages that rely on multiple API wrappers?...

noamross commented 6 years ago

This is a hard problem! I have an as-yet unfunded proposal to develop a system that tries to use text-recognition ML identify fields in a dataset and link them to appropriate ontologies - for instance, recognizing which columns are species, which are publications, and such. I believe that @amoeba has worked something similar for DataONE.

Ease of use is definitely one of the big challenges - I could see something like a function like find_ids() that you could run on a document and it would return items that might have IDs (author names, species names, publications) using a text-recognition model (or maybe some pre-trained services like monkeylearn.), perhaps with boilerplate code for searching them via those packages?

Running on CSVs or similar datasets is would probably be a bit harder because the off-the-shelf tools aren't as developed, DataONE has a set of curated, annotated data sets for training models, and was working on it, but I'm not sure the status of that.

cboettig commented 6 years ago

Great suggestion. Advice and simple, performant tools just to find the identifiers would be really cool -- too often most tools assume the user already has a pretty good grasp of the landscape and knows what they want.

An important piece of this puzzle I think are things that can deliver immediate value to the researcher implementing them; or at least a clear value proposition for why to use identifiers. The lesson idea sounds like an interesting way to go; it could illustrate both how to do a task like adding taxa ids, and demonstrate how that makes your life easier (say, in merging two datasets with differing taxonomy)?

cboettig commented 6 years ago

(On darwin core, @sckott also has WIP https://github.com/ropenscilabs/taxadc)

sckott commented 6 years ago

thanks @cboettig - i was just reading this issue.

wrt taxadc, that's just for taxonomy, to try to make it easy for users to convert https://github.com/ropensci/taxa classes to DWC compliant, and then serialize those objects to XML/JSON/JSON-LD/etc.

I agree some kind of tool that scans text for entities that might have unique IDs would be great. On the taxonomy front, there's i think a global names project tool that goes through and can identify taxonomic names in text. But I don't know of tools for other entities that may have identifiers.

We have at least some tools for identifiers for taxonomy and publications. Are there any major missing things that have identifiers that we don't yet have tools for?

magpiedin commented 6 years ago

Thanks for brains! I will keep fingers crossed for ML proposals, and all for something along the lines of a "how to add/merge taxon id's" lesson if it can help elucidate some steps in the meantime (+ compliment other things in the works here/locally/globally).

@sckott -- two things with id's but not tools of their own are (as far as I know?): 1 - Institutional identifiers - GRBio/GRSciColl has a registry of 'Cool' HTTP URIs 2 - Multimedia identifiers - "dcterms:identifier" in a few different standards including Audubon Core

That said, both of those might be a little shaky to develop anything around currently:

For Institutions, GRBio itself might be too in-flux currently & without an API, but they do have daily-updated dataset-downloads [which are a little oddly-structured and seem to be suspiciously missing the cool-URIs themselves...?]. But they're supposedly getting a makeover/takeover from GBIF in the near future...
For Multimedia, the current GBIF API can return media type, but I don't think it can return media identifiers [yet] -- either directly or in association with an occurrence record. -- Outside of GBIF, MorphoSource.org is one place that's starting to serve up Audubon Core datasets here -- & Outside of biodiversity data, IIIF might have some good directions to keep an eye on...

...I think I'm forgetting something, but I'll stop there...!

sckott commented 6 years ago

related to institutional identifiers, is the Organizational Identifiers group relevant ? https://orcid.org/content/organization-identifier-working-group and https://www.crossref.org/blog/organization-identifier-working-group-update/

For Institutions, GRBio ... supposedly getting a makeover/takeover from GBIF in the near future...

interesting, would like to learn more about this

Right, i've heard about IIIF, seems great.

the current GBIF API can return media type, but I don't think it can return media identifiers

here's some GBIF media data, what would the media identifiers be?

➜  ~ curl 'http://api.gbif.org/v1/occurrence/search?taxonKey=1' | jq '.results[].media'

[
  {
    "type": "StillImage",
    "format": "image/jpeg",
    "identifier": "https://static.inaturalist.org/photos/12648072/original.jpg?1514760468",
    "references": "https://www.inaturalist.org/photos/12648072",
    "created": "2017-12-31T22:46:43.000+0000",
    "creator": "John Flower",
    "publisher": "iNaturalist",
    "license": "http://creativecommons.org/publicdomain/zero/1.0/",
    "rightsHolder": "John Flower"
  },
  {
    "type": "StillImage",
    "format": "image/jpeg",
    "identifier": "https://static.inaturalist.org/photos/12648077/original.jpg?1514760475",
    "references": "https://www.inaturalist.org/photos/12648077",
    "created": "2017-12-31T22:46:43.000+0000",
    "creator": "John Flower",
    "publisher": "iNaturalist",
    "license": "http://creativecommons.org/publicdomain/zero/1.0/",
    "rightsHolder": "John Flower"
  }
]

cboettig commented 6 years ago

Media identifier... i.e., a URI corresponding to the format (mime type?) Maybe the wikidata identifier is a reasonable choice? https://www.wikidata.org/wiki/Q2195

Funder ids from FundRef are another obvious one. e.g. https://github.com/ropensci/codemetar/blob/master/inst/extdata/funderNames.csv

magpiedin commented 6 years ago

Nice! Looks like those are indeed multimedia identifiers in the GBIF data -- staring us in the face at the "identifier" key :)

"identifier": "https://static.inaturalist.org/photos/12648077/original.jpg?1514760475",

(And my understanding is those media identifiers are supposed to be unique/follow a URI structure, but aren't always a resolvable URL -- at least on GBIF and generally in the Audubon Core dcterms:identifier field)

I hadn't been thinking of media format type, but cheers to Wikidata as a reasonable choice, especially as a starting point for less common media formats or things without a main/direct repository to pull from. ("JHOVE" & "PRONOM" might relate here, but I'm out of my depth with those)

Good thinking on FundRef, too. It sounds like ORCID Organization IDs might overlap with FundRef and GRBio/SciColl, but the record data that could be pulled from each might be useful in different situations? (If that's not inviting disambiguation-problems -- & I don't think it would?)

amoeba commented 6 years ago

I believe that @amoeba has worked something similar for DataONE.

More or less, yeah! I've been generally working in this area for a few years along with many other collaborators. I feel like there are two issues that come up a lot:

Scientists don't know/want to make use of identifiers in their work
Scientists who (magically) do want to use identifiers might not be able to find an appropriate one / can't easily expand the existing identifier space for their needs

I think (1) is a much larger attack surface at this point and I'd love to brainstorm ideas. This feels somewhat related to https://github.com/ropensci/unconf18/issues/64 in that part of the research compendia review process might involve annotating code and data with appropriate identifiers (Hey, you might want to put your ORCID over here / annotate this column of this data file with this identifier).

sckott commented 6 years ago

Notes, chat w/ @magpiedin and @sckott

use case:
- start with a tabular dataset, has sci. names, collection identifiers, and for the most part no URI identifers
- in the end, tabular data with columns added with:
  - new data (see below)

new data

sci. name ids
ranks ids
collectedBy/identifiedBy identifiers
organization identifier
funder identifier
measurement identifiers
image identifiers

a possible approach

create plugins/adapters for each type of data

#' @param x (character) column
inspector_images <- function(x) {
    if (grepl("\\.jpg|\\.png", x)) {
      # look for an identifier for the image
      # if found, create a URI
      # if not found, give NA_character_
    }
}

then each plugin (like above) could be run over each data.frame input

plugins <- c(inspector_images, inspector_taxa) # a vector of plugins

#' @export
#' @param x a data.frame
inspect_df <- function(x, plugins = plugins) {
      # iterate over data.frame with each plugin
}

ropensci / unconf18

Taking some pain out of finding/linking to unique IDs? #52

new data

a possible approach