ropensci / RNeXML

Implementing semantically rich NeXML I/O in R
https://docs.ropensci.org/RNeXML
Other
13 stars 9 forks source link

NCBI URIs #225

Closed cboettig closed 5 years ago

cboettig commented 5 years ago

@hlapp I'm struggling with the preferred way to represent NCBI ids as URIs. Which do you prefer as the base URI?

Something else entirely? Maybe a non-URL URI?

In general I feel I don't have a good solution here. Thoughts / advice greatly appreciated.

hlapp commented 5 years ago

I would argue these are three different things, and I'll add a forth:

hlapp commented 5 years ago

So to actually try to answer your question:

[what is] the preferred way to represent NCBI ids as URIs.

I would argue it's the canonical URIs for taxa in the NCBI taxonomy. So here this would be http://ncbi.nlm.nih.gov/taxonomy/56308.

cboettig commented 5 years ago

Excellent, thanks, this is just what I needed. Just to understand better, what makes http://ncbi.nlm.nih.gov/taxonomy/56308 the canonical identifier vs https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=56308 being just a web application? I don't spot anything on the webpage of the former that suggests it is particularly canonical (though it's certainly a cleaner URL than the latter), arguably a purl.org URL might be more stable?

(related, here's the base URIs I have for going from turning IDs for GBIF and ITIS into URIs, not sure if either of them are the appropriate 'canonical' IDs either:

And I haven't found any prefix for resolving Catalogue Of Life ids into a URI...

hlapp commented 5 years ago

Just to understand better, what makes http://ncbi.nlm.nih.gov/taxonomy/56308 the canonical identifier vs https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=56308 being just a web application? I don't spot anything on the webpage of the former that suggests it is particularly canonical

IMHO it's not necessarily about the content of a web page but about what is in the URI. In the former, it is the organization maintaining the database, the name of the database itself, and the ID of the record within that database. That seems like a reasonably minimum set of information to create a globally unique identifier that is also resolvable to over http, and thus leaves little room for other URIs that would vie for the same "canonical" designation with seemingly the same strength of justification.

In the latter, there is also the name of a web-application, and the technology that the web-application uses to receive and execute a user request. Although in the case of NCBI this application, and it using this technology, has stayed around for a long time, that fact is an exception rather than the rule, eroding trust in its longevity. Also, there should be no expectation that this must remain the one and only web-application browsing the NCBI taxonomy database, including at NCBI, and if there were another one, we would have multiple application-specific URIs with no clear criteria as to which one should serve as the canonical one.

arguably a purl.org URL might be more stable?

Well, that goes back to the ages-old observation that stability of identifiers (in the meaning of longevity and continued resolution to appropriate content, however that may be defined) in the end is not a technical but a social problem. (And incidentally, the purl.org mechanism has demonstrated this multiple times, with periods of major downtime, and far too little work on technical modernization. There's a reason all OBO ontologies use purl.obolibrary.org as a layer of abstraction.) That's one of the reasons why DOIs cost money.

cboettig commented 5 years ago

@hlapp yup, this makes sense to me, though merely inspecting the URL semantics as being appropriately minimal still feels like a less than ideal basis on which to establish a canonical URI. (e.g. compare to the ITIS base URI above, which is the only way I know to turn the ITIS number into a URI, but clearly looks fragile by these criteria). Certainly the enforceable social contract provided by DOI (i.e. not just charging for the DOI, but only allowing a limited number certified organizations that maintain their redirects to to issue DOIs) feels more reliable.

With things like the ITIS URI, and even the multiple possible NCBI URIs, it seems plausible that I might be better off with non-resolving URIs, or maybe just sticking with prefixes like NCBI:9606 (though of course sticking with prefixes instead of URIs seems arse-backwards, at least from an XML-namespace mentality).