Canonical citation identifiers/URLs

strogonoff commented 2 years ago

Originally, we used source dataset references as canonical identifiers. Those are effectively filenames.

It was suboptimal, because those filenames aren’t themselves canonical, aren’t guaranteed to be universally unique (only unique within a dataset), a single citation could be comprised of data from multiple datasets, and generally these references are implementation details/artifacts of citation sourcing logic and can change (external tools that collect citation data aren’t concerned with consistency of URLs, nor should they be).

Now, we’re switching to document identifiers.

However, the problem is that one citation/bibliographic item can have multiple identifiers. One of type DOI, one of type ISBN, and they can even have multiple identifiers of the same type formatted slightly differently (e.g., IETF: https://github.com/ietf-ribose/relaton-data-rfcsubseries/issues/4).

One implication is that, if we depend on document identifiers, we can have multiple (sometimes nearly identical) URLs leading to the same citation (/types/IETF/RFC+3972/, /types/IETF/RFC3972/ and so on). This at least makes BibXML service a somewhat badly behaved Web citizen.
We could treat the first identifier as canonical and always redirect to it, but it’s unclear whether Relaton model gives any significance to identifier order of appearance (if it’s rooted in XML, it probably doesn’t because ordering doesn’t matter there). And if the order changes due to external tool implementation details, it’ll break canonical URLs.

One way would be to come up with our own identifier. It could be derived from citation data in some way (but that makes it liable to break if citation data changes), or it could be truly random like an UUID and we’d have indexing logic ensure it stays the same. I’m looking into this option.

strogonoff commented 2 years ago

What I’m going to do for now is avoid canonical URLs of the form /type/IETF/RFCXXXX in both API and GUI, and rather provide an endpoint a la /get-citation-by-docid?type=IETF&id=RFCXXXX, which would at least resolve the issue with citation accessible from multiple similar URLs.

However, it’d still be useful to provide a single canonical URL for a citation in HTML, which is unsolved for now.

ronaldtse commented 2 years ago

One implication is that, if we depend on document identifiers, we can have multiple (sometimes nearly identical) URLs leading to the same citation (/types/IETF/RFC+3972/, /types/IETF/RFC3972/ and so on). This at least makes BibXML service a somewhat badly behaved Web citizen.

It is actually common for some documents to have multiple canonical identifiers.

An identical document can have multiple document identifiers. This is the guidance from ISO 690.

So this solution is much better than the "single canonical" one.

strogonoff commented 2 years ago

I think we’re talking on different layers of abstraction.

As a web resource, we provide some entities (citation or bibitem, doesn’t matter in this context), and we want to have a 1:1 mapping between any entity (our subresource) and its canonical URL. (We allow discovering that entity by various means, but only one URL should be considered canonical.)

On subject domain layer, our entities do have identifiers, but there are multiple of them with equal power (there’s no “main one”). Some identifiers have scopes (which sounds like they are not global). Moreover, I believe entity identifiers can change on a whim, as we can’t rely on a single standard (like NIST PubID) adopted across the industry. The closest we get is a DOI, but not every citation has a DOI.

Back to web resource layer, if we use these subject domain identifiers, we would have the same logical entity appear as multiple distinct subresources, and worse tomorrow it may appear as different subresources than today if sourcing logic changes. So, they don’t quite satisfy web resource requirements for a unique canonical identifier per subresource.

Not that it’s a big big issue as far as immediate requirements go, but I’m sure this will have negative implications later on (broken external links, etc.).

strogonoff commented 2 years ago

ronaldtse commented 2 years ago

There are a few topics here mentioned that relate to identifiers.

Let's first adopt the terminology from ISO 690 where a "citation" is a mention of an "information resource", which has metadata described in a "bibliographic item", and that the data in the "bibliographic item" can generate a rendering for a "citation".

The "web resource layer" mentioned is about a PURL (Persistent URL). DOI is often the solution but it is only a partial solution. A DOI points to something. The something can change. For example, in NIST their DOIs point to the latest revision of a said document. It can get updated for every new edition. i.e. it is no longer persistent.

The "subject matter layer" mentioned is about an identifier for the "information resource", aka the "publication". All identifiers issued by organizations are naturally scoped (to the organization). NIST PubID, technically, is scoped to within NIST. In practice, it is likely that their identifiers are unique because no one will call their documents "NIST blah". A DOI is guaranteed to be globally unique. If you have the time to read through our latest blog post on NIST PubID, you can see that NIST PubID has a Machine-Readable (MR) form that is intended to generate a DOI identifier (that can be converted back to a human readable one).

I really like the idea of unique PubIDs and would advocate every SDO develop their own PubID scheme. But we can't force anyone to do that. I think it would be possible to convince many to do so, but that's still lots of work (unless we provide an easy way to adopt), and a long way off.

If we can synchronise the PubID and the DOI as done in NIST PubID, would you be happier?

strogonoff commented 2 years ago

If we can synchronise the PubID and the DOI as done in NIST PubID, would you be happier?

I’ll start from the end: there is no such thing as “PubID” if you go to https://github.com/relaton/relaton-models/blob/main/grammars/biblio.rnc and Ctrl+F, and I have heard about PubID as a concept in Relaton domain (not NIST PubID) just a couple of days ago in an internal discussion.

Let's first adopt the terminology from ISO 690 where a "citation" is a mention of an "information resource", which has metadata described in a "bibliographic item", and that the data in the "bibliographic item" can generate a rendering for a "citation".

I’m on board with that, already in the process of updating BibXML service source/docs to use “bibliographic item” instead of “citation”.

For example, in NIST their DOIs point to the latest revision of a said document. It can get updated for every new edition. i.e. it is no longer persistent.

I wouldn’t require a resource identified by an URL to be immutable, just that it remains logically the same thing after such changes and that (canonical) URL persists across changes.

From my understanding,

our key entity is a bibliographic item, described per Relaton model.
Experience shows that in web service architecture ensuring each entity has a unique persistent identifier is a good way to avoid certain issues in long term.
Bibliographic item has some document identifiers in its subject domain—a DOI, an ISBN, some publisher-assigned identifier, etc. (the matter of document identifier types in Relaton is discussed elsewhere).
However, if BibXML service can’t reliably derive from item’s subject domain data (BibliographicItem schema) a single canonical identifier that will be stable regardless of any organizational changes/politics/etc.,
it should assign each item an impartial random identifier (e.g., UUID) that holds these properties within BibXML service itself (e.g., not to be part of BibliographicItem schema).

I don’t want to make it a bigger deal than it’s worth, it could be an implementation detail, just making sure we can’t derive an identifier.

strogonoff commented 2 years ago

This will remain open for now. For reasons described above (possible lack of a singular canonical identifier), the service doesn’t offer reliable canonical URLs for bibliographic items, purposefully supporting querying data by GET query parameters only.

ronaldtse commented 2 years ago

the service doesn’t offer reliable canonical URLs for bibliographic items

For the stable PubIDs, we are able to offer canonical URLs, right?

strogonoff commented 2 years ago

the service doesn’t offer reliable canonical URLs for bibliographic items

For the stable PubIDs, we are able to offer canonical URLs, right?

Well, we don’t have PubID covering all standards currently. We have primary docids, but they can change.

What I think could work great is properly standardized strongly normalized URNs. Which I recall should be part of the PubID initiative. But we don’t have them either yet. (And I don’t think the service should pretend to offer canonical URLs while they can change depending on source data and index state.)

rjsparks commented 2 years ago

It's not clear if this is a discussion of a potential future enhancement or a bug that needs to be addressed - please help clarify/classify?

ronaldtse commented 2 years ago

This was a proposal, and can be considered a future enhancement.

This proposal only works for tightly structured PubIDs like "RFCnnnn" or "ISO nnnn", it does not work for identifiers that cannot be normalized, like those from IEEE.

ietf-tools / bibxml-service

Canonical citation identifiers/URLs #66