sfb1451 / tabby-utils

0 stars 2 forks source link

External queries for DOIs #2

Open mslw opened 1 year ago

mslw commented 1 year ago

Note: this is good to have eventually, but not a priority.

By requiring some information to be provided in a standardized way in the tabby files submitted by our collaborators, we are able to resolve additional details from that information by querying a respective lookup service. For example, "NCBITaxon_9238" can be looked up in OLS to get its name (Pygoscelis adeliae) and common name (Adelie penguin) that would go into the catalog page.

The tools for that are in queries.py.

One particular case is a publication doi (note: queries.py contains some related but unused code that was my initial attempt). There are two major problems with that (both resolvable):

Query - for what (and when)

The catalog wants publications to be broken up into title, authors, year, publication outlet, doi. For our tabby spec, we asked for DOI and citation text (DOI is optional, but when DOI is given a citation text becomes optional). The citation can be free-form and can not be easily broken down into parts, so we currently dump it all into catalog's "title" for a reasonably-looking presentation. Certainly, we should do a doi lookup if citation text is not given (didn't happen in submitted files). Two questions though: should we still do a doi lookup if citation is given (to get author list etc.)? Should we do a title lookup (to get doi, authors etc.) if a doi is not given (been there, done that when scraping sfb website for publications).

Which API for DOI

I thought that for resolving dois we could rely on Crossref's doi to metadata query, but it aparently only resolves crossref-issued dois (most of the publications I know, but not e.g. zenodo). The issuing agency can be checked with crossref:

❱ curl https://api.crossref.org/works/10.1371/journal.pone.0090081/agency
{"status":"ok", ..., "message":{"DOI":"10.1371\/journal.pone.0090081","agency":{"id":"crossref","label":"Crossref"}}}
❱ curl https://api.crossref.org/works/10.14454/FXWS-0523/agency
{"status":"ok", ... ,"message":{"DOI":"10.14454\/fxws-0523","agency":{"id":"datacite","label":"DataCite"}}}       

but the actual metadata will only be returned for the former:

❱ curl https://api.crossref.org/works/10.1371/journal.pone.0090081
{"status":"ok", ...}
❱ curl https://api.crossref.org/works/10.14454/FXWS-0523
Resource not found.

DataCite also has its API but the response might be slightly different, adding work for processing.