w3c / rch-wg-charter

Charter proposal for an “RDF Dataset Canonicalization and Hash Working Group”
https://w3c.github.io/rch-wg-charter/
Other
12 stars 7 forks source link

Explain why signing *any* serialization is not sufficient? #2

Closed pchampin closed 3 years ago

pchampin commented 3 years ago

Disclaimer: I am not a security or cryptography expert, so this may be a stupid question.

The first paragraph of §1 Scope gives examples of why signing RDF datasets is useful. However, one could imagine a poor man's solution where the an arbitrary serialization of the dataset is signed, and that serialization is then used whenever the dataset needs to be exchanged.

I am sure that there are cases where this wouldn't work, and I can vaguely imagine them. But I believe that they should be briefly spelled out in the charter's intro, to justify that their is an actual need beyond the poor's man solution above (if only for the sake of philistines like me ;).

iherman commented 3 years ago

(I am actually working on a branch where the background information is spawn into a separate document. This allows us to give more info, if necessary, but keep the charter succinct.)

The problem is, as usual, with blank nodes. If that serialized version of a dataset is cast in concrete, but that is not necessarily the case. At the minimum if I transfer a JSON-LD version from me to you, by the time you display it in your editor some extra spaces may sneak in, or may disappear, and then the signature goes wrong. Even on that level some type of JSON canonicalization becomes necessary. But if I decide to flatten the same JSON-LD serialization of the graph, turn it into, say, turtle, then hell breaks loose because blank nodes may get arbitrary identifiers, where things go definitely wrong.

msporny commented 3 years ago

then hell breaks loose because blank nodes may get arbitrary identifiers, where things go definitely wrong.

Yes, what Ivan said. In addition to that...

I am sure that there are cases where this wouldn't work, and I can vaguely imagine them. But I believe that they should be briefly spelled out in the charter's intro, to justify that their is an actual need beyond the poor's man solution above (if only for the sake of philistines like me ;).

The issue is that there are many ways to express the same information using a graph. Your entry point into the graph determines how you serialize, and even if you can serialize based on some simple metric (like, sort all URLs/Triples/Quads in alphabetical order), that metric won't work when you introduce blank nodes (because, how do you name things also depends on your entry point into the graph).

... and the problem really goes beyond RDF blank nodes. JSON objects have the same problem, and even if you were to express a set of information using JSON, you still need a canonicalization algorithm (because every JSON object is effectively an unlabeled node in a graph). All that to say, this is a fairly well studied area of computer science and the problem isn't as simple as it may seem on the surface. History is riddled with half-baked solutions. What we're talking about standardizing is the first general purpose solution to the problem, which has proofs that have been vetted by mathematicians and so we can be fairly certain that if we standardize this thing, we'll be solving this problem once and for all. :)

So, to answer the original issue question: "Why is signing any serialization is not sufficient?" -- it's because you need to be able to tell if the information in YOUR serialization is the same as some OTHER serialization, and you can't do that without a canonicalization algorithm. That is the purpose of a canonicalization algorithm... to determine if the information expressed by YOUR serialization is the same as some OTHER serialization.

pchampin commented 3 years ago

Thank for your responses. Let me clarify my question.

The poor man's solution I sketched above is clearly not very practical, for all the reasons pointed out by Ivan (and more). But in an "ideal world", it could work (if tools didn't tamper with formatting, if everyone used the same syntax for RDF, if...). I understand that, as we don't live in an ideal world, there is reason enough to not apply that solution.

But I was wondering if there were more fundamental arguments against it, i.e. use-cases where it would be theoretically impossible to always use the SAME serialization? Are there for example realistic scenarios where two different signers may independently construct and sign the same dataset with the same private key? Or cases where someone only receives the signature, without any serialization, and has to reconstruct the dataset from other information, before checking the signature? That's where my knowledge of cryptography principles and use-cases falls short...

you need to be able to tell if YOUR serialization is the same as some OTHER serialization, and you can't do that without a canonicalization algorithm.

I don't believe that the last part of this sentence is accurate. Checking that two graphs (or datasets) are isomorphic (i.e. "the same") can be done, I think, without building a canonicalized version of them.

But of course, having such a canonicalization algorithm provides you with an easy way to that, and much more. :)

Again, I am not for a second questioning the value of this work. Just trying to better understand what's at stake, and to make the arguments in the intro more compelling.

iherman commented 3 years ago

I don't believe that the last part of this sentence is accurate. Checking that two graphs (or datasets) are isomorphic (i.e. "the same") can be done, I think, without building a canonicalized version of them.

It may be mine turn to be ignorant, but: checking that two graphs A and B are isomorphic means that there is a deterministic algorithm to re-label all the bnodes of graph A to the bnodes of graph B without destroying the topology. I am afraid that, mathematically, this is the same problem as defining a canonical version of a graph.

pchampin commented 3 years ago

In order to check isomorphism, you only need to find a "working" mapping between the bnodes of A and the bnodes of B. You do not need to mint deterministic identifiers for the bnodes.

Here is a naive algo for checking isomorphism btw A and B:

None of the M(A) is a canonical representation of A.

msporny commented 3 years ago
  • if M(A) = B then return true

What equivalence function are you using to do the comparison above? :)

"Naming a bnode" isn't merely about picking a name. You have to be able to uniquely identify that node from all other nodes, no matter where you start your traversal in the graph. That is at the heart of any graph canonicalization algorithm -- being able to tell one node from another node, using not only the nodes attributes, but their connectivity as well.

iherman commented 3 years ago
  • for all possible mapping M from bnodes(A) to bnodes(B)

    • if M(A) = B then return true
  • return false

Yeah, that may work (where '=' is simply a set equivalence) but for more complex graphs the number of possible mappings become pretty high, so I do not expect that to be a feasible algorithm. Although much as the existing canonicalization algorithms are mathematically complex, I expect that for most cases (where the algorithm finished quickly) it becomes way more effective to use a canonicalization function to establish the equivalence.

But we diverge. The issue is not whether canonicalization is necessary for graph equivalence; that is only a side effect.

Are there for example realistic scenarios where two different signers may independently construct and sign the same dataset with the same private key?

All crypto is based on the assumption that different signers have different private keys. But I must admit I do not understand the question...

Or cases where someone only receives the signature, without any serialization, and has to reconstruct the dataset from other information, before checking the signature?

No, that is not a realistic use case either. Actually, that is impossible; signatures usually depend on some hashing done somewhere along the line, which makes this type of reconstructions (intentionally) impossible.

But what is a realistic scenario is that the sender stores the dataset in some graph database and the receiver retrieves the graph from that database. Or a tool is used to change the serialization: e.g., Manu always thinks in terms of JSON-LD, I prefer Turtle. I.e., there is an intermediate which transforms the graph by adding different BNode labels (which is of course perfectly ok!). In which case only a signature based on the canonical version of that graph is usable.

It is correct that, if I take a Verifiable Credential, I regard it as a pure JSON text, that has some funny looking key in it that starts with a @ and ends with ext, then we could live with all this and rely on some JSON canonicalization (or even work without it and rely on the JSON text to be immutable). But, that would be brittle. And... (that is an essential point of this WG) the canonicalization+signature stuff is not only for Verifiable Credentials.

pchampin commented 3 years ago

But we diverge.

We do :) What I take home from this discussion is that the main reason for avoiding the poor man's solutions is that it is too brittle in real life. That's good for me, so I'll close this issue on that. Thanks both for taking the time to answer my question.


Now, for the pleasure of the argument ;)

@msporny

What equivalence function are you using to do the comparison M(A) = B

Since M(A) and B are using the same (local) identifiers for blank nodes (those from B), we can simply sort their respective quads in lexicographic order, and compare the two lists. Of course, if M(A) ≠ B, it does not mean that A and B are not isomorphic -- it may just be that we picked the wrong mapping. Hence the loop on all possible mappings.

@iherman

I do not expect that to be a feasible algorithm

No of course. That's why I called this algorithm "naive" ;-) I was just pointing out that the two problems are not equivalent...