w3c / rch-wg-charter

Charter proposal for an “RDF Dataset Canonicalization and Hash Working Group”
https://w3c.github.io/rch-wg-charter/
Other
12 stars 7 forks source link

Is canonicalization single? #45

Closed iherman closed 3 years ago

iherman commented 3 years ago

(Originally raised by @samuelweiler in https://github.com/w3c/strategy/issues/262#issuecomment-822696701; moved here with permission.)

When I think of canonicalization for signing, a key property of the canonicalized form is that there is a single such form - the function always gives the same result. When I look at the charter explainer, that property isn't clear. Am I just not understanding those words? Should that language be a little clearer?

iherman commented 3 years ago

You are of course right: canonicalization is a single form.

But... the §1 of the explainer defines:

RDF Dataset Canonicalization is a function C that maps an RDF Dataset to an RDF Dataset in a manner that…

i.e., it is defined as a function.

Of course there may be several such functions and the goal of the standard is to define either a single function or, most probably a family of functions that is parametrized by the choice of a hashing function for the way it operates.

Can you propose a change on the explainer text that makes this clearer?

samuelweiler commented 3 years ago

Perhaps the best word here is "deterministic" - the output from a given canonicalization function is deterministic. What might be clearer is an explanation what what sorts of RDF differences you want to have the same canonical form. That might be encoded is the definitions (as you start to quote above), but to me, as an newcomer to RDF, I'm not sure what is meant. A plain example might help? e.g. "here are two RDF datasets that look different that we want to consider as equivalent".

e.g. some canonicalization systems do case folding, so that "World Wide Web Consortium" and "world wide web consortium" canonicalize to the same thing. What differences are we trying to canonicalize away here?

[ https://github.com/w3c/lds-wg-charter/issues/52 is digging into this topic...]

iherman commented 3 years ago

@samuelweiler I can give you an overly simple example here:

_:A <http://example.com/prop1> "Some literal".

and

_:xyz <http://example.com/prop1> "Some literal".

and

_:ABCEFG <http://example.com/prop1> "Some literal".

are all isomorphic RDF graphs, and there is an infinite number of them: the subject, in all three, is a blank node, and the label can be just about any string. What canonicalization does is to deterministically calculate a blank node label, starting from any of the forms above, yielding, say, '_:c1234', i.e., generating the graph

_:c1234 <http://example.com/prop1> "Some literal".

which would be considered the canonical form of those isomorphic graphs.

However, I am a bit uneasy going into the explanation in the charter explainer. I do not think this is the place. There is a reference to Aidan's paper in the explainer document that gives a good introductory explanation of the problem...

samuelweiler commented 3 years ago

However, I am a bit uneasy going into the explanation in the charter explainer. I do not think this is the place.

Fine. I defer to your judgement.