information-artifact-ontology / ontology-metadata

OBO Metadata Ontology
Creative Commons Zero v1.0 Universal
19 stars 8 forks source link

Unify the representation of dois? #59

Open matentzn opened 3 years ago

matentzn commented 3 years ago

As part of a big OBO ID cleanup, I noticed the following variation of dois across OBO foundry ontologies (all resolve):

According to wikipedia:

Another approach, which avoids typing or cutting-and-pasting into a resolver is to include the DOI in a document as a URL which uses the resolver as an HTTP proxy, such as https://doi.org/ (preferred)[28] or http://dx.doi.org/, both of which support HTTPS

https://doi.org would be preferred; Shall we make it an official recommendation to use that one to make integration of provenance data easier?

bpeters42 commented 3 years ago

I like that a lot. It actually also very nicely separates the central registry (doi.org) from the specific identifier.

On Wed, Dec 16, 2020 at 6:56 AM Nico Matentzoglu notifications@github.com wrote:

As part of a big OBO ID cleanup https://github.com/orgs/OBOFoundry/project/5, I noticed the following variation of dois across OBO foundry ontologies (all resolve):

https://dx.doi.org/10.4103/0971-9261.109351 http://www.doi.org/10.4103/0971-9261.109351 http://dx.doi.org/10.4103/0971-9261.109351 http://doi.org/10.4103/0971-9261.109351 https://doi.org/10.4103/0971-9261.109351

According to wikipedia https://en.wikipedia.org/wiki/Digital_object_identifier:

Another approach, which avoids typing or cutting-and-pasting into a resolver is to include the DOI in a document as a URL which uses the resolver as an HTTP proxy, such as https://doi.org/ (preferred)[28] or http://dx.doi.org/, both of which support HTTPS

https://doi.org would be preferred; Shall we make it an official recommendation to use that one to make integration of provenance data easier?

Shall we make it an official recommendation to use

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/information-artifact-ontology/ontology-metadata/issues/59, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJX2IQLJHAU7HMSXL4BWODSVDDB5ANCNFSM4U6DTNFA .

-- Bjoern Peters Professor La Jolla Institute for Allergy and Immunology 9420 Athena Circle La Jolla, CA 92037, USA Tel: 858/752-6914 Fax: 858/752-6987 http://www.liai.org/pages/faculty-peters

hlapp commented 3 years ago

I think the only two viable contenders are to follow the preferred representation given by the DOI federation, or to use the DOI "naked", i.e., without the HTTP/HTTPS proxy, possibly using a doi: prefix.

Unfortunately, neither option is perfect. The first would follow DOI recommendation and is naturally resolvable (LD compliant out of the box). However, the DOI recommendation has changed over time (no proxy, then dx.doi.org, then http://doi.org, then https), making one wonder how it might be changed again, and leading to multiple incarnations of the same identifier in representations that are not equal as strings. The second option is immune to future representation recommendation changes, but possibly invents a protocol prefix (if using doi:) and whether using a prefix or not requires special knowledge on the client side for identifier resolution, which is an LD no-no.

So on balance, I lean fairly strongly to option one. Especially as it seems questionable how the representation recommendation might be changed still, given that the https://doi.org proxy prefix seems as minimal as one could reasonably get. Also, LD compliance ranks high for me, and in fact it has (more recently at least) for the DOI federation, too.

cmungall commented 3 years ago

Note another solution is to use identifiers.org or nt2.net https://registry.identifiers.org/registry/doi

But on balance I agree with Hilmar's analysis

note also that many ontologies include DOIs as annotation axiom values in string CURIE form. GO uses DOI as a prefix (just as we used "PMID:" etc). But I think this should be treated as an orthogonal concern.

graybeal commented 3 years ago

nt2.net doesn't resolve…

hlapp commented 3 years ago

nt2.net doesn't resolve…

Typo. The correct one is n2t.net.

paolaroncaglia commented 2 years ago

@bpeters42 @hlapp @cmungall @graybeal Hi, reviving this thread please as I noticed that prefixed DOIs, i.e. DOI:xxx and doi:xxx, do not currently resolve in Protege. Upon discussion with @gouttegd and @matentzn , we'd suggest that if anyone wants to make an argument for these, we'd need to open a ticket in the Protege tracker very quickly please (before the next release is out). Thank you! Paola Roncaglia (ontology developer for Uberon, CL and EFO)

gouttegd commented 2 years ago

I’d vote for doi:xx.yyyy/zzzzzz. This is basically the recommendation from the DOI handbook

And if we rely on our tools (Protégé and the like) to automatically use a resolution service (such as dx.doi.org), I’d even like that resolution service to be configurable (e.g. having an option “Name of the server for the resolution of DOI (default: dx.doi.org)").

That being said, if we do choose to bake the resolution service in the identifiers, then we should at least always use the same form so that identifiers can always be compared as opaque strings, in which case I’d vote for https://doi.org/.

(Not http://doi.org/ or http://dx.doi.org/. TLS everywhere, please.)

matentzn commented 2 years ago

See also

cthoyt commented 2 years ago

See also

* [Update DOI entry biopragmatics/bioregistry#288](https://github.com/biopragmatics/bioregistry/pull/288)

* [Use preferred DOI uri_format biopragmatics/bioregistry#316](https://github.com/biopragmatics/bioregistry/pull/316)

Summary of discussion on the Bioregistry:

The DOI triple store uses the http://dx.doi.org/ but the DOI resolution factsheet specifies that https://doi.org/DOI is the preferred format:

Users may resolve DOI names that are structured to use the DOI system Proxy Server (https://doi.org (preferred)). The resolution of the DOI name in this case depends on the use of URL syntax: the example DOI name doi:10.10.123/456 would be resolved from the address: "https://doi.org/10.123/456". Any standard browser encountering a DOI name in this form will be able to resolve it. The proxy service (both doi.org and the earlier but no longer preferred dx.doi.org) is accessible over IPv6, and supports DNSSEC. The proxy servers respond to HTTPS (preferred) as well as HTTP requests.

gouttegd commented 2 years ago

@cthoyt As I understand it, https://doi.org/xx.yyyy/zzzzzzz is the preferred format for resolution, not for storage.

I am happy to show DOIs under the form https://doi.org/xx.yyyy/zzzzzz, in contexts where users could reasonably expect resolvable URIs (though as I said I’d like the resolution service to be configurable). But I do think we should store them as doi:xx.yyyy/zzzzzz, with the resolution server only being prepended when resolution is necessary.

cthoyt commented 2 years ago

@gouttegd since within an OWL file or other semantic web context, you can have the prefix doi correspond to https://doi.org/, I think we are on the same page :)

matentzn commented 2 years ago

@cthoyt and @gouttegd are on a similar page but I think not the same:

@cthoyt correct me if I am wrong, I think you are talking about RDF prefixes. So declaring prefixes up top and then seeing the prefix syntax in the RDF file.

@gouttegd correct me if I am wrong, I think you are talking about representing dois as "doi:xx.yyyy/zzzzzz"^^xsd:string so that protege users can simply curate the curie string rather than the full URI.

Needless to say that having to spell this difference out is madness in and all of itself. Obviously what we want ist the first, but since cant have that (without getting someone to fix our tools), the question is whether we have to use the second..

cthoyt commented 2 years ago

I would strongly oppose "doi:xx.yyyy/zzzzzz"^^xsd:string as it is effectively representing a URI as a string, and since we're in the semantic web context, I would much rather represent a URI with a URI

gouttegd commented 2 years ago

Looking at what is done in Uberon at the moment, I see a lot of things like that:

<oboInOwl:hasDbXref rdf:datatype="http://www.w3.org/2001/XMLSchema#string">https://doi.org/10.1098/rspb.2013.3120</oboInOwl:hasDbXref>

To me this is the worst of both worlds: the cross-reference looks like a URL, but is actually a string, and a string in which we mix the actual identifier (the DOI) and the resolution service…

matentzn commented 2 years ago

I tend to agree @cthoyt, but this concern can be split into a curation and a release problem. We can have the curators write the dois as strings, and then introduce a post processing step that turns them to IRIs. That's been done for years, just not.. consistently enough.

matentzn commented 2 years ago

@gouttegd 100% that is the worst - that needs to go.

gouttegd commented 2 years ago

I would strongly oppose "doi:xx.yyyy/zzzzzz"^^xsd:string as it is effectively representing a URI as a string

Why not "doi:xx.yyyy/zzzzzz"^^xsd:anyURI?

After all doi:xx.yyyy/zzzzzz is a URI. It is fully compliant with the generic URI syntax set forth in RFC 3986 (with doi as the scheme, and xx.yyyy/zzzzzz as a rootless path component). And doi is registered (though only “provisionally“, for now) as a valid URI scheme in the IANA registry.

matentzn commented 2 years ago

@gouttegd this is not enough if you want to integrate with the rest of the World, I.e wikidata - you need to be able to make statements about a doi like when it was created and who it was created by, anyUri doesn't allow for that!

gouttegd commented 2 years ago

@matentzn Sorry I don’t follow you here. What does this need to “make statements about a doi” have anything to do with how we store and/or show DOIs?

I don’t see of any of the ways we can use to represent DOIs (be it https://doi.org/xxx, doi:xxx, or whatever) would allow that…

matentzn commented 2 years ago

Basically, having it possible to have a triple like:

<https://doi.org/xx.yyyy/zzzzzzz> dc:creator orcid:123 .

EDIT: this statement could come, for example, from wikidata.

you cant have a literal, even if it is xsd:anyUri in the position of the subject of a triple!

gouttegd commented 2 years ago

But if xsd:anyURI is not suitable, then xsd:string isn’t either, right?

So what’s the solution then? Should we represent DOIs as first-class objects?

matentzn commented 2 years ago

Exactly - this is what @cthoyt suggests and I think so too. But it does not answer the question of how to store the DOIs in curation.. Here I kinda tend to using DOI + xsd:anyURI plus SPARQL update command for expansion to entity, but I am not sure if I.. underestimate the overhead this introduces to the community. Also, it makes QC a bit harder. I would want some more feedback on this from @balhoff @cmungall @jamesaoverton about what they think..

gouttegd commented 2 years ago

So if I understand correctly, ultimately you’d want something like that:

<oboInOwl:hasDbXref rdf:resource="https://doi.org/10.1098/rspb.2013.3120" />
matentzn commented 2 years ago

Yeah correct.

gouttegd commented 2 years ago

OK, but then I am still annoyed by the fact that we take something that is already a URI (doi:10.1098/rspb.2013.3120) and we make it into another URI that mixes the identifier and the resolution service…

Ideally, I’d rather use the identifier “as is“: rdf:resource="doi:10.1098/rspb.2013.3120" and let the tooling (Protégé, ROBOT, whatever) do the resolution.

matentzn commented 2 years ago

You and a million other people alike share that annoyance :)

graybeal commented 2 years ago

I don't know if it needs to be said, but there are 4 upvotes for the original proposal (https://doi.org/DOI), including mine, and I haven't seen anything to change my mind. I would apologize for DOI not having gotten this right 10 or 15 years ago, if that helps... (sorry, too snarky! very glad they have come around!!)

dosumis commented 2 years ago

My (possibly dumb) perspective: When building a ontology-driven resources (e.g. VFB; OLS) it's much easier to deal with database_cross_references if the values are all CURIEs, as long as I have a context mapping I can use to expand them -> URLs to drive linkouts.

Mixing in http: and https: means I need special code for those cases. It also opens the door to allowing in random, potentially unstable URLs (how will I tell the difference between these and stable ones like those for DOIs? More special code?).

Mixing in patterns like this would be a burden on curators and also need special code for checking and using I guess it might be more sustainable than plain https...

So <oboInOwl:hasDbXref "doi:10.1098/rspb.2013.3120"> is very much preferable.

Also note - if these are added by hand by ontology editors, I don't want to have to rely on them getting the typing right in order to be able to reliably process downstream.

matentzn commented 2 years ago

VOTE 2: Curate DOIs as curies vs IRI

As @graybeal points out we have all but decided to use IRI syntax for release files. This vote is not really OBO business at all, but a show of hands may be useful:

Should we implement a pipeline that allows curators to curate DOIs as CURIE strings and later expand them to IRI in the release?

@graybeal sorry about mixing two issue here in one, should have been clearer.

The remaining question is the one @dosumis is raising in his issue:

  1. 👍 We curate curie strings (a very complex concept for non-OBO people) in xrefs and postprocess to expand to the DOI IRIs in the release (burden on pipeline developers)
  2. 🎉 We curate IRIs from the start (burden on curators)

IMO both are fine, I am 49/51 preferring 2, but I am happy to support @dosumis and @gouttegd in their quest for 1.

dosumis commented 2 years ago

(BTW - I disagree with the division of burden. A mix of IRIs and Curies (which are still the majority of xref values) is a burden on developers, it's pretty easy for editors.)

alanruttenberg commented 2 years ago

Looking back I see I started a response over a year ago supporting Hilmar's, but forgot to send it. I still support that option. So I'll put in another vote to use the IRI form https://doi.org/...

While it is true that the recommended proxy server has changed over time I don't see this as a problem. All the forms still resolve. I think we can be confident that they will continue to be resolved.

From the handbook:

[3.8.1. Resolving DOIs using the Proxy Server System]()

The DOI system uses the Handle System® to manage digital objects (see the DOI Factsheet "DOI System and the Handle System"). At the infrastructure level, DOI names are handles.

The DOI system Proxy Server is basically a web server that knows how to talk to the Handle System, and at this writing, most DOI® names found on the web are embedded in URLs that use the proxy server for DOI name resolution.

Display of DOIs is a separate matter. If the desire is to display the IRIs as "DOI:..." use a rdfs:label annotation.

I think the two things that should be important for us are

1) We all use the same form 2) That form is resolvable without fuss

Handles are resolvable but not without fuss as the handle protocol, unlike the HTTP protocol, is not supported by many clients. I just put DOI:10.1098/rspb.2013.3120 into Firefox and into Chrome. Neither resolved the IRI and instead did a search. Moreover, when I went to the page with the hit the DOI was displayed as https://doi.org/10.1098/rspb.2013.3120

matentzn commented 2 years ago

@alanruttenberg Yeah, I think we all but made that choice.. What remains though is the question of whether ontology curators, when using protege, should record IRIs, or wether they can be left to record DOI: compact ids, which we then unfold into the IRI form https://doi.org/... using SPARQL. Here is were we have quite a bit of disagreement..

gouttegd commented 2 years ago

As another detail to keep in mind, please note a limitation of the OBO format:

If the curator enters doi:xx.yyyy/zzzzzz as an IRI in Protégé (which in my opinion would be The Right Thing™️ to do), this produces the expected output in OWL Functional Syntax (Annotation(oboInOwl:hasDbXref <doi:xx.yyyy/zzzzzz>)) and OWL RDF/XML syntax (<oboInOwl:hasDbXref rdf:resource="doi:xx.yyyy/zzzzzz"/>).

However, in OBO Format this is simply translated as xref: doi:xx.yy/zzzzzz or [doi:xx.yyyy/zzzzzz] (depending on what the cross-reference is applied to). Because there’s no typing in the OBO Format, and because the obo2owl library presumably does not recognise doi as a valid URI scheme, this is interpreted as a literal string, not an IRI.

(Adding that to my mental list of “reasons why the OBO format should die”.)

matentzn commented 2 years ago

I don't think <doi:123/123> is on the table.. not even sure this is a valid IRI. only doi:123/123 is! Even if the first is permissible in RDF, It is highly ideosyncratic - IRIs must be interpretable as URLs, otherwise we have to rethink too much of our technology stack.

gouttegd commented 2 years ago

I don't think doi:123/123 is on the table.. not even sure this is a valid IRI

Per RFCs 3986 and 3987, it is.

only doi:123/123 is!

I guess you meant https://doi.org/123/123 here, otherwise there’s something I don’t understand.

Anyway: Leaving aside the question of whether we should treat DOIs as IRIs (which they formally are, though nobody treat them as such), shall cross-references in general be represented as:

The first option seems to be the preferred option, because 1) it is immediately resolvable (if we restrict ourselves to HTTP IRIs); 2) it allows to “make statements about the IRI“.

The fourth option seems to be the worst, because 1) it is not resolvable until the prefix has been expanded; 2) it does not allow to “make statements about the IRI“.

As far as I can tell, at the moment we are overwhelmingly using the fourth option, even for non-DOI cross-references.

The solution proposed here for DOIs (and which could be generalised for all cross-references), if I understand correctly, would be to use the 4th form (CURIEs as literal strings) when editing, and then to transform them into the 1st form (real IRIs) when building the release artefacts.

But that solution does not solve the problem that, according to a message above, we want to be able to “make statements about an IRI“ – how could we do that if, at edit time, the IRI is only represented as a string?

matentzn commented 2 years ago

Your summary is exactly correct. So the last question you ask is:

But that solution does not solve the problem that, according to a message above, we want to be able to “make statements about an IRI“ – how could we do that if, at edit time, the IRI is only represented as a string?

And the answer: you cannot! The assumption is that while we are editing an ontology, we do not want to capture metadata about a DOI. Which may be wrong, but, for now, that is the normal practice.

gouttegd commented 2 years ago

we do not want to capture metadata about a DOI. Which may be wrong

Is it?

Now that I think of it, why would we want to “make statements about a DOI”? I think this would be akin to importing a term from a foreign ontology and then add our own statements about that term – something that I thought was frowned upon.

matentzn commented 2 years ago

Its certainly not practice. Metadata about a doi should be obtained from the doi metadata provider.

gouttegd commented 2 years ago

OK, then I would support treating DOIs the same way as we currently treat other cross-references:

1) curate as literal CURIEs strings (PREFIX:zzzzzz, which for DOIs would mean doi:zzzzzz); 2) optionally (we currently don’t do that, but we could) expand all CURIEs in cross-references upon release.

“Expanding” meaning that, e.g.,

Sure, "doi:10.1234/abcdef"^^xsd:string is not resolvable at edit-time (e.g. in Protégé), but then neither is "FBbt:01234567"^^xsd:string and until now we had no problem with that.

I would not object to curate as IRIs directly, but if we do then I strongly believe we should do so for all cross-references and not only for DOIs. That is, if we ask curators to curate a DOI xref as https://doi.org/10.1234/abcdef (as IRI, not as string), then we should also ask them to curate a FBbt xref as http://purl.obolibrary.org/obo/FBbt_01234567 and not as FBbt:01234567.

matentzn commented 2 years ago

Yes what you are describing is pretty much my suggestion as well.. However, we are in the unfortunate situation that a lot tooling that is build for using OBO ontologies consumes them in our beloved OBO format - that wont change (I repeat - no matter how hard we hope - it wont change). So changing all xrefs to follow the logic proposed here by you (in the last comment) will basically break dozens of tools that have learned to handle xrefs with reference to OBO terms using the CURIE syntax. It is all bad.

But if I may say so - even a small step forward (or backwards, anywhere really) in OBO borders on the impossible. There is always someone "strongly against X" - this is just the nature of open source, and bottom up organisations. Pushing through a change that essentially turns dbxrefs into range IRI is going to cause a lot of resistance. Alternatively, I would suggest quite the opposite:

Trust me, if we open this issue now to general handling of hasDbXref range we will have a mutiny.

gouttegd commented 2 years ago

OK, so no expansion then (I did say “optionally”).

Then I am even more strongly in favour of curating DOIs as "doi:10.1234/abcdef"^^xsd:string – that is, not prepending any resolution service. We do not prepend purl.obolibrary.org to xrefs pointing to other OBO ontologies, why should we prepend doi.org to xrefs pointing to DOIs? Let’s just accept that xrefs are opaque strings that require consumers of the ontology to know how to deal with them. At least we’d be consistent.

In the meantime, we can try to slowly move away from OBO-style xrefs where we can (e.g. for new terms) in favour of annotations with a more defined semantic, where we wouldn’t be so constrained by existing technical debt and could possibly use real IRIs because nobody would be expecting anything anyway.

matentzn commented 2 years ago

tldr; summary of thread

Alright, makes sense. Lets summarise this thread for all tldr readers:

gouttegd commented 2 years ago

Just: where we expand, could we expand DOIs to TLS-enabled URLs please? https://doi.org/ instead of http://doi.org/. Every time we use a plain HTTP URL, somewhere in the world a cryptographer dies.

(And I’d love to be able to do the same for PURLs, but one mutiny at a time, I guess. :D )

(It really is a pity that the HTTP protocol makes the use of encryption visible in the scheme even though it should be a purely technical detail…)

matentzn commented 2 years ago

Oops typo. Sorry @gouttegd of course - see first vote. Changed comment above.

dosumis commented 2 years ago

we just use IRI syntax from the start

And then PubMed, or FlyBase or whoever changes their URL pattern and we have a 10s of thousands of outdated IRIs to update, or, more likely, no update and lots of 404s.

dosumis commented 2 years ago

Another example: VFB uses xrefs for external DB identifiers. The xref system allows us to roll these into URLs to some resources, but also use them for database queries. FlyBase Identifiers work in the same way. We can use a FlyBase pub identifier for SQL queries of the FlyBase DB and to roll a link to the relevant FlyBase reports page. Am I supposed to parse IRIs instead?

Am I missing something?

matentzn commented 2 years ago

Valid point! Thats why we act on a case by case basis for making this decision. But doi's and orcids, the subject of this issue here, are stable enough for the plan proposed here. Rember that RDF/OWL also have a prefix system; so if we do it properly, and URLs change, we only need to change the prefix declaration itself. You can envision a technical solution for updating the RDF prefixes in a more dynamic fashion from a centralised context, and then you just need to update that context.

And remember - no one here (as far as I understand) is against CURIE:syntax. Just against "CURIE:syntax"^^xsd:string.

This is ok:

@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix BFO: <http://purl.obolibrary.org/obo/BFO_> .
@prefix FB: <http://purl.obolibrary.org/obo/FB_> .
@base <http://www.w3.org/2002/07/owl#> .

[ rdf:type owl:Ontology ] .

BFO:0000050 rdf:type owl:ObjectProperty .
BFO:0000050 oboInOwl:hasDbXref FB:001 .

Your suggestion is basically this:

BFO:0000050 oboInOwl:hasDbXref "FB:001"^^xsd:string .

With some magic tool client side knowing where to obtain the obo context and unfolding the CURIE on usage. This is also not great from a users point of view. Dont think FB for FlyBase. Think hundreds of ontologies with thousands of prefixes - how will you ever manage this madness?

This would be best:

@import prefixes from https://github.com/biopragmatics/bioregistry/blob/main/exports/contexts/bioregistry.context.jsonld
@base <http://www.w3.org/2002/07/owl#> .

[ rdf:type owl:Ontology ] .

BFO:0000050 rdf:type owl:ObjectProperty .
BFO:0000050 oboInOwl:hasDbXref FB:001 .

Haha but unfortunately I don't think this can happen.

matentzn commented 2 years ago

To echo your point: where a hasDbXref is literally an internal database id without any hope to resolve usefully on the web, we could probably just leave it be. No unfolding. But it is annoying that people need to look up what database the id is for, and to make sure that is clear, we need to write some very specific validation code.

alanruttenberg commented 2 years ago

There's no reason OBO annotators can't enter this as doi:xxx. This can be translated into an IRI at OWL translation time. The dbxrefs are not fun because they aren't unique identifiers and can't be easily dereferenced. In some cases this is unavoidable because it really is a database that is being referred to, and no mechanism has been provided to access individual records on the web. But it's a loss when the dbxref is to something that does have a legitimate IRI and it's much more useful to have the IRI than the unresolvable xref. Even in the cases where it's a database, there's not enough information for a proper citation. Somewhere, I'm sure, there's a file which says what those prefixes mean, but I don't know what it is because it isn't cited in the OWL file.

alanruttenberg commented 2 years ago

@dosumis

it's much easier to deal with database_cross_references if the values are all CURIEs, as long as I have a context mapping I can use to expand them -> URLs to drive linkouts.

I think you are misunderstanding CURIES. They are not meant as a mechanism to enable run-time expansion. They are abbreviations for specific IRIs and are, in any case, not intended for use in RDF/XML. In RDF/XML abbreviations can defined using namespaces or entities.