information-artifact-ontology / ontology-metadata

OBO Metadata Ontology
Creative Commons Zero v1.0 Universal
19 stars 8 forks source link

Unify the representation of dois? #59

Open matentzn opened 3 years ago

matentzn commented 3 years ago

As part of a big OBO ID cleanup, I noticed the following variation of dois across OBO foundry ontologies (all resolve):

According to wikipedia:

Another approach, which avoids typing or cutting-and-pasting into a resolver is to include the DOI in a document as a URL which uses the resolver as an HTTP proxy, such as https://doi.org/ (preferred)[28] or http://dx.doi.org/, both of which support HTTPS

https://doi.org would be preferred; Shall we make it an official recommendation to use that one to make integration of provenance data easier?

dosumis commented 2 years ago

@alanruttenberg wrote:

Somewhere, I'm sure, there's a file which says what those prefixes mean, but I don't know what it is because it isn't cited in the OWL file.

I strongly agree that this should be fixed. We badly need to move away from the bespoke YAML file into which this info has been curated for years. Whatever form this takes, it should be cited in the OWL file.

I think the key question here is whether we can keep a system that allows dual usage as of DOI (and many other) xrefs as a source of IDs for database/API and for resolvable URLs.

The dual use of identifiers for database/API lookups and as components of resolvable URLs is critical for almost everything we build. Publication identifiers are a good example. For any resource I'm building, I can't know in advance what identifiers for publications I'm going to get - DOI, PMID, resource specific ID (e.g. FlyBase:FBrf), and if I want to provide biblio data, associated metadata or curation, or track updates, I need an ID I can use for a database or API lookup. In each case, I also want to be able to roll a resolvable URL.

Let's look at what would happen if we used URLs for DOIs. Say I have:

https://doi.org/10.4103/0971-9261.109351

And I want to pull our the DOI to query EuroPMC: https://www.ebi.ac.uk/europepmc/webservices/rest/search?query=10.4103%2F0971-9261.109351&resultType=lite&cursorMark=*&pageSize=25&fromSearchPost=false

I now need specific code to separate out the identifier component for DOIs. I can't even reliably split on / or # to get a short form - something which would work in many other cases. I also need code to recognise that the URL is a DOI and so must be treated differently from other xref values (PMIDS, FlyBase:FBrfs). If some of these values remain as strings, I need code that recognises http: and https: as special prefixes that map to protocols rather than databases IDs / IRIs.

OTOH - doi:10.4103/0971-9261.109351 + a reliable prefix map lets me cover both use cases painlessly. So, I think the question is - can we achieve this dual usage with some more standard RDF approach with curies/prefixes rather than strings?

In other words, can we make something like this work:

@matentzn wrote:

This would be best:

@import prefixes from https://github.com/biopragmatics/bioregistry/blob/main/exports/contexts/bioregistry.context.jsonld
@base <http://www.w3.org/2002/07/owl#> .

[ rdf:type owl:Ontology ] .

BFO:0000050 rdf:type owl:ObjectProperty .
BFO:0000050 oboInOwl:hasDbXref FB:001 .

Haha but unfortunately I don't think this can happen.

dosumis commented 2 years ago

More generally - it is essential to avoid solutions that break our ability to interface with the non-semantics world.

matentzn commented 2 years ago

I think we all mostly agree on what would be best in general, (using doi:123 with a prefix map), but this thread mixes up a lot of issues

Let's play a bit with my suggestion above to restrict our attention to DOIs for now, and ship duplicated content:

hasDbXref "DOI:10.4103/0971-9261.109351"^^xsd:string
skos:related <https://doi.org/10.4103/0971-9261.109351>

As a second step we need to curate our own prescriptive prefix map we ship around and see what we can do about "sharing" it - right now I am not sure that is possible (bit lets make a new ticket for that, this one is too long).

alanruttenberg commented 2 years ago

There's no such thing as a CURIE string. There's a CURIE, which this is not. Let's call this something else.

matentzn commented 2 years ago

Does making a distinction between CURIE and "Compact Unique Database Identifier" really help all that much in this discussion? This is the nature of this unnamed thing:

  1. Its a string that is split into a prefix and an identifier part
  2. the prefix identifies a resource (such as a database, e.g. FlyBase)
  3. The identifier corresponds to an identifier used by that source (and is unique in the context of that source)

How would you call it? CURDI? Compact Unique Reference to a Database Identifier?

alanruttenberg commented 2 years ago

The word CURIE is defined by a standard. We aren't using it in the way it is defined. It may not obviously help this discussion, but it's bad practice and diminishes respect for standards. It isn't unique. It's unambiguous as the value of that property if enforced. With luck it's unambiguous in biology papers. Outside biomedicine all bets are off.

The sad thing is that there's no reason for it NOT to be unique. Make it a PURL(purl.obolibrary.org/db/XXX:YYYY) and it is. Manage change in the same way we manage our term IRIs. Abbreviate it as a CURIE in user interfaces. Done. Then we're really using unique ids and not mucking around with ad-hoc mechanisms for which there are perfectly well-defined standards.

matentzn commented 2 years ago

I hear you @alanruttenberg, let's call them "database cross-references" moving forward. I am trying to improve the situation with these by:

  1. Stopping people to refer to other ontologies, in particular, OBO ontologies, as "database cross-references"
  2. Get the community to at least document a prefix map for database cross-reference so I have any hope to understand what they mean.

Hope that makes sense. (this is also getting further and further off the topic of this issue).

alanruttenberg commented 2 years ago

Don't get me wrong, It's great that you are aiming to improve this. I'm looking at this as an opportunity to see how close to "right" we can get. What's wrong with the PURL suggestion. I don't think I've suggested it in the previous thread. It seems to satisfy both the ability to easily work with them and redirect them, but also have a proper IRI. From a software point of view, it's just a longer prefix.

bpeters42 commented 2 years ago

Let me re-iterate to see if I understand that recommendation Alan. We would coin purls for e.g. 'DOI' that make e.g. http://purl.obolibrary.org/db/DOI:10.1093/database/baaa016 resolve to https://doi.org/10.1093/database/baaa016, and we can then deal with changes in where exactly they should re-direct to centrally. And for user-facing systems, we present that as DOI:10.1093/database/baaa016

So this would essentially serve as the 'prefix map' that Nico was advocating for.

I am a little unclear if this means we would use db/DOI:XXX vs. obo/GO:XXX to distinguish ontology vs. 'database' sets of prefixes? Also, I am wondering how we will decide where identifiers such as PMID:XXX should resolve to; the HTML website? The record retrieved via the NCBI eutils (which tragically is currently out of sync with the Pubmed website)?

Also, we would greatly expand the scope of the PURLS system; James reminds me always that these things come at a cost, even for a simple server doing nothing but redirects. Here the costs could very well be that database maintainers not part of the OBO community can be trusted even less than the OBO community itself, and we may end up with plenty of dead redirects. But all of that are issues we will have for any system?

On Tue, Mar 15, 2022 at 7:15 AM Alan Ruttenberg @.***> wrote:

Don't get me wrong, It's great that you are aiming to improve this. I'm looking at this as an opportunity to see how close to "right" we can get. What's wrong with the PURL suggestion. I don't think I've suggested it in the previous thread. It seems to satisfy both the ability to easily work with them and redirect them, but also have a proper IRI. From a software point of view, it's just a longer prefix.

— Reply to this email directly, view it on GitHub https://github.com/information-artifact-ontology/ontology-metadata/issues/59#issuecomment-1068038066, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJX2ITT4ZDMYW355BAKGODVACLPVANCNFSM4U6DTNFA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: <information-artifact-ontology/ontology-metadata/issues/59/1068038066@ github.com>

-- Bjoern Peters Professor La Jolla Institute for Immunology 9420 Athena Circle La Jolla, CA 92037, USA Tel: 858/752-6914 Fax: 858/752-6987 http://www.liai.org/pages/faculty-peters

matentzn commented 2 years ago

In the hope I am not parroting myself too much, I would recommend shelving this discussion for the time being. There are too many discussions happening here, too many unspoken issues about existing tools and assumptions. It is premature to coordinate anything at the moment. For my work (not OBO), I will define a simple prefix map just for the interpretation of cross-reference prefixes and link it from the ontology using an AP (#93).

alanruttenberg commented 2 years ago

With apologies to @matentzn I'll answer @bpeters42 .

For DOIs specifically, I have suggested we use their own documented server address. They are a relatively special case as it is likely they will be long-lived. But, we could go through a PURL if we had doubts about longevity.

For databases that is the idea. The /db vs /obo separates the ontology PURLs from DB purls, that just to be protective and avoid unintended consquences.

I assume that the PURLs would redirect to some provider. @dosumis suggested that these need to be remapped sometimes and this would be done via PURL config. For resources that do not have a web accessible page per record all the PURLs would be redirected to a page that gives information about what the ID means and general information about going about finding information about it, sometimes pointing to a website, other times to papers. If we have prefixes for which we have no way for a user to get information about records at all I'd suggest they don't belong in our files - they are just noise.

Presentation is however a tool wants to present it. It can do something ad-hoc, or always present as or have label annotations. Presentation is an orthogonal issue to identification.

As far as flaky database providers go, monitoring can be automated with automatic fail over to a the descriptive page I suggest for above. But I presume, by the conversation, that people are using these things and so there is incentive to keep the PURL config up-to-date. The biggest issue about flaky providers is, to me, is less on a particular outage and more an issue of whether a database provider is known to be flakey in general. If that's the case I question the utility of the identifiers in the first place. if a user has a problem finding info about a DB identifier, that's a problem. But, as you say, this will be a problem with any system. At least with the PURLs we can redirect to an informative page and stick with a tried a true way of identifying things instead of inventing a new way e.g. having a side file that has to be downloaded to interpret prefixes.

Yes, there is a cost to any infrastructure. The cost needs to be weighed against utility. To my mind, every user interested in these xrefs having to know about and then mess with an ad-hoc mapping of some sort is much more costly in aggregate, IMO, than the cost of running the server, viewed in the context of the overall efforts you all are putting into automation.