w3c / rch-wg-charter

Charter proposal for an “RDF Dataset Canonicalization and Hash Working Group”
https://w3c.github.io/rch-wg-charter/
Other
12 stars 7 forks source link

bad terminology in explainer document #99

Closed pfps closed 2 years ago

pfps commented 2 years ago

RDF blank nodes do not have ids. The explainer document indicates that they do. This mismatch needs to be addressed. The mismatch affects much of the quasi-formal stuff in the explainer document. It looks to me as if the working group will have to develop new terminology to meet the needs of RDF canonicalization. This new terminology should not affect existing RDF terminology (i.e., don't add blank node labels to RDF graphs, instead canonicalize using a pair of an RDF graph/dataset and a mapping from its blank nodes to labels).

As this work has security implications the underlying definitions need to be hydrogen-tight (i.e., air-tight but for hydrogen, which is harder to contain).

aidhog commented 2 years ago

IIRC, David Booth raised a similar issue a while back, and we made some changes.

I think the only mention of ids in the context of blank nodes in the explainer document is here:

without depending on the particular set of blank node identifiers used in the original serialization of the input RDF Dataset

While blank nodes do not have identifiers nor labels nor any syntactic form in the abstract RDF data model, I guess they can have "identifiers" in a serialization of an RDF dataset (such as N-Quads). Arguably we're okay since we say that we don't depend on them, but maybe there's cleaner language possible; something like:

without depending on how blank nodes are serialized in the original syntax of the input RDF Dataset

Another option is just to drop the whole "without ..." part, leaving:

Such a canonicalization function can be implemented, in practice, as a procedure that deterministically labels all blank nodes of an RDF Dataset in a one-to-one manner.

In case that the issue is more about the language of "labelling" a blank node, we could go one further and write about producing a labelling of blank nodes; something like:

Such a canonicalization function can be implemented, in practice, as a procedure that outputs a deterministic one-to-one labelling of all blank nodes of an RDF Dataset.

Though I would see that as somehow equivalent and maybe a bit more verbose?

pfps commented 2 years ago

I would just yank the sentence.

pfps commented 2 years ago

As well, suppose that the function depended on the order of triples in the input document. Is that allowable?

iherman commented 2 years ago

I think the only mention of ids in the context of blank nodes in the explainer document is here:

without depending on the particular set of blank node identifiers used in the original serialization of the input RDF Dataset

While blank nodes do not have identifiers nor labels nor any syntactic form in the abstract RDF data model, I guess they can have "identifiers" in a serialization of an RDF dataset (such as N-Quads).

The Turtle Recommendation systematically uses the term "blank node label". So does the N-triple Recommendation. The JSON-LD specification uses the term "blank node identifier" and so does RDF/XML. Finally, and somewhat more surprisingly, the RDF Semantics specs also uses the term blank node identifier.

All this today is that there is no consistency even among the formal RDF related specifications. And although I do agree that, when it comes to the formal standard to be defined by the WG, an airtight terminology is necessary, let us not forget that this is an explainer text, whose goal is to explain the underlying concepts to non-RDF experts, primarily AC reps, to help them cast their votes. We may make things more difficult, albeit more precise, to understand the problem area if we are not careful. In this context, it does not seem to be so problematic to use the term "identifier", or perhaps "label".

pchampin commented 2 years ago

As well, suppose that the function depended on the order of triples in the input document. Is that allowable?

Since the function is defined on an RDF Dataset (abstract syntax), for which the order is not significant, I think it should be clear that it would not be allowed. Granted, it would be even clearer if we spelled it out. Do you think we need to?

pfps commented 2 years ago

If there is an explicit exclusion for blank node labels in input documents then I think there needs to be exclusions for other aspects of input documents. The, probably better, alternative would be to require that the only information the algorithm has access to is an RDF dataset. (But that should have been the case all along - bringing in the possibility of utilizing other information is at best misleading.)

pchampin commented 2 years ago

the algorithm has access to is an RDF dataset. (But that should have been the case all along

That was the intention.

  • bringing in the possibility of utilizing other information is at best misleading.

agreed; therefore I propose to either

iherman commented 2 years ago

agreed; therefore I propose to either

  • yank the end of the sentence ("without depending on..."), or
  • make it more generic, something like "without depending on any feature of the input serialization (blank node labels, order of the triples, etc...)."

I am (mildly) in favor of the second alternative

pchampin commented 2 years ago

@pfps, with PR #103 merged, can we close this issue?