w3c / rdf-canon

RDF Dataset Canonicalization (deliverable of the RCH working group)
https://w3c.github.io/rdf-canon/spec/
Other
13 stars 8 forks source link

[Editorial] Discrepancy in the way we refer to Blank Node IDs #46

Closed iherman closed 1 year ago

iherman commented 1 year ago

There is a slight discrepancy between the RDF terminology and the terminology used in our specification document for blank node identifiers.

The description of the algorithm, all the examples, and probably all the implementations, systematically use the format _:xyz for blank node identifiers. This is also significant, e.g., in the Hash Related Blank Node algorithm that uses the blank node id-s as part of a path that gets hashed at a later step.

This is not in line with the RDF Concepts and Abstract Syntax spec, that speaks of abstract blank node id-s in general. The usage of the _: is only a serialization artifact for Turtle, N-Triple, or JSON-LD. Note that, in RDF/XML, blank nodes are identified using the rdf:nodeID attribute, whose value is an arbitrary string; no _: there.

The reason why this matters is because RDF implementations, I presume, usually rely on the RDF spec and do not use _:, except for parsing and serialization to Turtle, N-Triple, or JSON-LD. E.g., in rdf-js a blank node is created through DataFactory.BlankNode('abc') and, more importantly for our use, when the id of a blank node is retrieved via bnode.value(), the identifier being returned is without the _:. The same happens for Python's RDFLib. This type of discrepancy can be a source of implementation bugs that can be difficult to trace.

I presume that, at this moment, it is too late to change this, because we do not want to make existing implementations fail on something like that. But it may be worth adding a note somewhere to warn the users about this potential issue and also make it clear in the implementation (e.g., in step 4 of the Hash Related Blank Node algorithm) that the _: must be presented in the constructed string being hashed. This type of warning might be helpful for implementers...

cc @pchampin @gkellogg @davidlehn @dlongley

dlongley commented 1 year ago

I presume that, at this moment, it is too late to change this, because we do not want to make existing implementations fail on something like that. But it may be worth adding a note somewhere to warn the users about this potential issue and also make it clear in the implementation (e.g., in step 4 of the Hash Related Blank Node algorithm) that the _: must be presented in the constructed string being hashed. This type of warning might be helpful for implementers...

+1 to this.

gkellogg commented 1 year ago

See https://github.com/w3c/rdf-canon/pull/45#discussion_r1037414861. I think we can change to remove the dependency on blank node labels/identifiers on the input, and simply talk about the blank nodes themselves. When talking about canonical identifiers, which are used through the algorithms, and in the context of the "normalized dataset" (which we may want to rename), I think we can do what we want, as long as we're clear about.

iherman commented 1 year ago

I think we can change to remove the dependency on blank node labels/identifiers on the input, and simply talk about the blank nodes themselves. When talking about canonical identifiers, which are used through the algorithms, and in the context of the "normalized dataset" (which we may want to rename), I think we can do what we want, as long as we're clear about.

I believe that will change the result of the canonicalization, because (currently) the _: string is part of data that is hashed. This means that the results of this algorithm will change compared to the one that has already been deployed (although, I believe, the changes will only affect the n-degree hash, so the effects may not be too serious in practice). Although I would personally prefer to make the algorithm "clean" (ie, without the _: all around), I am not sure we want to knowingly introduce a backward incompatibility...

gkellogg commented 1 year ago

We don't hash input blank node identifiers, only temporary or canonical identifiers. This shouldn't affect the results in any way, just as if the blank node identifiers on the input were randomized, or entirely absent. In the abstract syntax, there are no identifiers anyway.

TallTed commented 1 year ago

[@iherman] backward incompatibility

Would this (possible) backward incompatibility be with anything at CR/PR/REC stage, or just with earlier draft efforts?

Presuming it's drafts, I don't think we should be overly concerned with any backward incompatibility, especially when the change results in something significantly "cleaner".

iherman commented 1 year ago

We don't hash input blank node identifiers, only temporary or canonical identifiers. This shouldn't affect the results in any way, just as if the blank node identifiers on the input were randomized, or entirely absent. In the abstract syntax, there are no identifiers anyway.

We may want to check the code overall; hopefully you are right. But we do have text like these in the text:

If the blank node's existing blank node identifier matches the reference blank node identifier then use the blank node identifier :a, otherwise, use the blank node identifier :z.

This is misleading and may lead to bugs (eg, the issue identifier algorithm might return _: which may affect the algorithms' results in terms of the calculated hash...

gkellogg commented 1 year ago

@iherman wrote:

We may want to check the code overall; hopefully you are right. But we do have text like these in the text:

If the blank node's existing blank node identifier matches the reference blank node identifier then use the blank node identifier :a, otherwise, use the blank node identifier :z.

These are temporary identifiers used in Hash First Degree Quads.

We will certainly need to account for the leading _: in identifiers, but that's entirely an internal algorithmic matter, and can be better described in the text.

@TallTed wrote:

Would this (possible) backward incompatibility be with anything at CR/PR/REC stage, or just with earlier draft efforts?

Presuming it's drafts, I don't think we should be overly concerned with any backward incompatibility, especially when the change results in something significantly "cleaner".

There are no existing official specs, but stable implementations that are in the field, and preserving compatibility with them, but not changing the output of the algorithms is a consideration. In any case, I don't see this as requiring any such change (not that something else might come up which would require a URDNA2022).

iherman commented 1 year ago

For the records: I went through my code with a magnifying glass, and removed all traces of _: and possible conversion to and from a _:xyz to xyz (the parsing from turtle is an external library, so is quad conversion, so a black box to me). The results of the two examples in the spec are identical, including all the internal logging points I have with the hash values. Ie, my worries were unfounded; the issue is entirely editorial in the text itself.

gkellogg commented 1 year ago

I asked GPT-3 the difference between blank node identifiers and labels. I don’t think it’s right, but it is a novel interpretation.

"What is the difference between a blank node identifier and a blank node label?"


"A blank node identifier is a unique identifier assigned to a blank node in a graph. It is a random string of characters that is used to identify the blank node within the graph. A blank node label is a human-readable name assigned to a blank node in a graph. It is used to make the graph easier to read and understand."

iherman commented 1 year ago

I could actually live with separation of terms along these lines. For blank node identifiers, the RDF11 Concepts and Abstract Syntax says (in a note):

Blank node identifiers are local identifiers that are used in some concrete RDF syntaxes or RDF store implementations. […] Blank node identifiers are not part of the RDF abstract syntax, but are entirely dependent on the concrete syntax or implementation. […]

The RDF11 Semantics document is silent on the subject, and just says

Blank nodes are treated as simply indicating the existence of a thing, without using an IRI to identify any particular thing.

That means it is perfectly fine if we use the term "labels" for the thing we use in the algorithm.