w3c / EasierRDF

Making RDF easy enough for most developers
266 stars 13 forks source link

Lack of standard RDF canonicalization #26

Open dbooth-boston opened 5 years ago

dbooth-boston commented 5 years ago

Canonicalization is the ability to represent RDF in a consistent, predictable serialization. It is essential for diff and digital signatures. Developers expect to be able to diff two files, and source control systems rely on being able to do so. It is easy with most other data representations. Why not RDF? Answer: Blank nodes. Unrestricted blank nodes cause RDF canonicalization to be a "hard problem", equivalent in complexity to the graph isomorphism problem.[6]

IDEA: JSON-LD canonicalization

Some recent good progress on canonicalization: JSON-LD https://json-ld.github.io/normalization/spec/ . However, the current JSON-LD canonicalization draft (called "normalization") is focused only on the digital signatures use case, and needs improvement to better address the diff use case, in which small, localized graph changes should result in small, localized differences in the canonicalized graph.

More discussion and analysis of canonicalization: https://github.com/w3c/strategy/issues/116 https://github.com/w3c/strategy/issues/116#issuecomment-383875628
https://github.com/w3c/strategy/issues/116#issuecomment-384160630 https://github.com/w3c/strategy/issues/116#issuecomment-395791130 https://github.com/w3c/strategy/issues/116#issuecomment-435920927

IDEA: RDF canonicalization

http://aidanhogan.com/docs/skolems_blank_nodes_www.pdf http://aidanhogan.com/docs/rdf-canonicalisation.pdf https://github.com/iherman/canonical_rdf https://lists.w3.org/Archives/Public/www-archive/2018Oct/0011.html

chiarcos commented 3 years ago

This involves two problems: Blank nodes and order. If the first can be solved, what works for diff in practice is to transform your data to nt, sort it, serialize it to turtle (just for readability), and then diff over the turtle versions. This is not necessarily convenient and maybe not intuitive, and there may be better tooling, but this doesn't need the languages involved to change. There may be more clever ways, in particular ways that also support streams and not just data dumps, but having with this in mind, this issue can IMHO be closed and further discussed under issue #19. The signature aspect is addressed by JSON-LD canonicalization.

dbooth-boston commented 3 years ago

@chiarcos , I agree that if the blank node problem is solved then RDF canonicalization will be easy. But I think it is worth keeping this issue open for two reasons: 1. someone noting the lack of RDF canonicalization will not necessarily know to look at the blank node issue; and 2. even when the blank node issue is solved, a canonicalization standard still needs to be defined.

Also, note that although JSON-LD canonicalization is an an excellent step in the right direction, the original algorithm did not address the diff use case, in which a small change to the source graph is likely to yield only a small change in the resulting canonicalization. Discussion about canonical RDF suggests that changes to the algorithm were being considered, but as of this writing I do not know whether the proposed algorithm has been upgraded to address the diff use case.