w3c / rch-wg-charter

Charter proposal for an “RDF Dataset Canonicalization and Hash Working Group”
https://w3c.github.io/rch-wg-charter/
Other
12 stars 7 forks source link

Anticipate a future RDF specification #25

Closed gkellogg closed 3 years ago

gkellogg commented 3 years ago

It seems likely there may be a new RDF and/or SPARQL working group chartered near the time that canonicalization is chartered. It would be useful for the specification to be defined in such a way that could anticipate a range of future updates without requiring canonicalization, itself, to be revisited.

For example, there is a fair amount of work behind RDF-star right now, and it would likely be considered by future RDF and/or SPARQL working groups. Language for canonicalization could be written to allow for a triple to have more than two blank nodes by not over-specifying that blank nodes are constrained to be just the subject or object of a triple. In my experience, this has worked for The RDF 1.1 description of dataset isomorphism to allow for comparing the results of rdf-star dataset evaluation tests.

The principle could be to use language that specifically allows for extension by other specifications, rather than being overly prescriptive. Describing the boundaries of such extension could be challenging to not invalidate the mathematics of the canonicalization algorithm.

gkellogg commented 3 years ago

I do see this reflected in the proposed charter as a liaison relationship to RDF-DEV, but wanted to make a more specific use case to serve as a specific goal of the work.

gkellogg commented 3 years ago

Also, note that the case of RDF-star, where a triple becomes a first-class resource useable within a triple is not unique. Notation-3 allows Formulae/Quoted Graphs as resources, has a first-class Collection resource not based on a blank-node ladder, and Universal Variables. It would be desirable if the canonicalization algorithm could be extensible to suit such other cases.

The general trend has been to allow more resource types in more positions. RDF 1.1 defines a Generalized Dataset as allowing an IRI, Blank Node or Literal in any position.

iherman commented 3 years ago

I am personally fine considering generalized RDF/Datasets as a possible target for canonicalization; I do not expect they would create an algorithmic problem (@dlongley, @aidhog, am I right on this?). But we have to be careful with this: we cannot take the commitment for all future changes.

RDF-star is relatively easy after all: based on the current semantics in the paper, it is possible to map an RDF-star instance to a good-old RDF graph via the "unstar" approach. By defining that "unstar" deterministically we can say that the hash of RDF-star is the hash of this deterministic unstar version, and we are done. If things cannot be mapped onto a (generalized) RDF then all bets are off I believe...

However. At this point the question is whether something must be changed on the charter text itself. The only place I can see is the liaison with RDF-DEV; I indeed do not believe it is possible to add such a requirement into the normative requirements of the charter. I could, for example, add a reference to generalized RDF in that line as a further example. Would that be enough at this point?

iherman commented 3 years ago

@gkellogg I have prepared a #26 to address this issue in the charter.

dlongley commented 3 years ago

@iherman,

I do not expect they would create an algorithmic problem...

No, I don't think it would create a problem, but supporting "generalized RDF" would require a few changes to allow for blank nodes in the predicate position.

gkellogg commented 3 years ago

RDF-star is relatively easy after all: based on the current semantics in the paper, it is possible to map an RDF-star instance to a good-old RDF graph via the "unstar" approach. By defining that "unstar" deterministically we can say that the hash of RDF-star is the hash of this deterministic unstar version, and we are done. If things cannot be mapped onto a (generalized) RDF then all bets are off I believe...

Not necessarily, although there is not yet a normative "unstar" operation defined, you might imagine that one way would be to transform embedded triples into a triple within an anonymously named graph, and use the name of that graph in place of the embedded triple (and repeat recursively):

Thus < :a :b :c > :d :e . becomes _:g :d :e . :a :b :c _:g . in N-Quads. But this means that two distinct datasets would map to the same dataset, which doesn't seem right.

(Of course, another might be to simply use the RDF reification vocabulary, but it has similar issues of confusion).

A property that an RDF-star triple such as <_:a :b _:c> :d _:c . has is that you can extract the blank nodes from the statement via a recursive operation which is a more generalized way of just taking it from the subject or object positions. If the wording of the spec read more like "extract the blank nodes from the statement", with a definition of what that means for base RDF and Generalized RDF, then a spec such as RDF star could define what it means in their context, so that the underlying canonicalization spec becomes extensible. Of course, this may not cover all potential future cases, but could provide a form of future proofing that would allow future changes to continue to operate.

Just a thought for consideration.

@gkellogg I have prepared a #26 to address this issue in the charter.

I think this wording certainly gives space to respond to these kinds of concerns, and I can't think of anything else that would need to be in the charter to address it. But, consider it when work gets under way proper.

aidhog commented 3 years ago

Agreed that it would not be a major issue algorithmically, but might result in something less concise or less performant (the unavoidable cost of being more general).

Option one is to just represent whatever the structure you have as an RDF graph or dataset in a canonical way (which might require some reserved vocabulary to distinguish the input structure and the output RDF graph/dataset as being two different things).

Option two is to generalise the algorithm to work with arbitrary n-ary tuples that may contain blank nodes in arbitrary positions.

iherman commented 3 years ago

A property that an RDF-star triple such as <_:a :b _:c> :d _:c . has is that you can extract the blank nodes from the statement via a recursive operation which is a more generalized way of just taking it from the subject or object positions. If the wording of the spec read more like "extract the blank nodes from the statement", with a definition of what that means for base RDF and Generalized RDF, then a spec such as RDF star could define what it means in their context, so that the underlying canonicalization spec becomes extensible. Of course, this may not cover all potential future cases, but could provide a form of future proofing that would allow future changes to continue to operate.

I would be a bit worried: this would require a fairly significant change on the way the current algorithms work (which very much rely on a 'flat' structure of the graph). What you propose may require some sort of a recursive approach and I am not sure how well that would work (and the algorithm is already fairly complex as is, adding this recursivity, even if possible, may quickly become very difficult to manage). Using the 'unstar' approach (whatever the details of 'unstar' is) makes it really straightforward for signature.

Extending the algorithm to generalized graphs is a different matter. That might be manageable.

But we are running way ahead of ourselves!