w3c / rdf-canon

RDF Dataset Canonicalization (deliverable of the RCH working group)
https://w3c.github.io/rdf-canon/spec/
Other
13 stars 8 forks source link

Compare the two algorithms, and decide on basis for our work #6

Closed peacekeeper closed 1 year ago

peacekeeper commented 2 years ago

At TPAC 2022, the RCH WG had a joint meeting with the VC WG, see minutes.

During that meeting, we had two presentations about canonicalization algorithms:

The first algorithm has some track record of incubation in the W3C CCG, and it has an existing specification.

On today's RCH WG meeting, there was a proposal to use the W3C CCG specification as a basis, but OTOH there was also a desire to compare the two algorithms to better understand how they differ, and why one might be used over the other.

So this issue is an invitiation to anyone who can tell us more about the differences between the two algorithms and their potential implications.

dlongley commented 2 years ago

My understanding of the two algorithms is that both of them are quite similar in that they use a "two step" process (similar to what Ivan described on the last call):

  1. Examine all of the immediately adjacent data to each blank node in order to label the "easy to label" blank nodes.
  2. For the remaining "potentially hard to label" blank nodes, perform a more complex process of examining more data until all of the blank nodes are eventually labeled.

I expect that both algorithms differ in mostly trivial ways for the first step, but produce different output based on the different choices. Some of these choices may, for example, include things like the hash function. The URDNA2015 choice of SHA-256 as a hash function was based on availability in hardware and Web browser APIs and normative-referenceability (i.e., chosen for developers and W3C spec authors). My understanding is that the algorithm Aidan presented tested a number of different swappable hash algorithms, mostly looking at performance outcomes -- irrespective of developer / W3C spec-authoring considerations.

I actually also expect that both algorithms are not that different in the second step.

Based on the explanation of the algorithm in Aidan's presentation, it seems that each of the "potentially hard to label" blank nodes is passed through a process that involves recursively visiting every neighboring blank node, assigning a provisional label (or "color") to each as they are visited. These labels (and the other data such as predicates, etc.) are used to produce a hash and ordering, where the ordering of different components is externally defined (chosen) ahead of time. The "shortest" ordering is chosen and canonical labels are assigned using that ordering.

If I have understood this correctly, then the second step of URDNA2015 is again, nearly identical here.

In URDNA2015, each "potentially hard to label" blank node is passed through a process that involves recursively visiting every neighboring blank node, assigning a temporary label to each, keeping track of these mappings for every permutation. As the graph is walked, "paths" are created out of the "mentions" of any other blank nodes, where a "path" is just a serialization of the temporary label (or final canonical label if a given blank node has already been assigned a canonical label) and the predicates and directions used in the walk. These paths are lexicographically compared to find the "shortest" path. Note this should be roughly equivalent to what happens in the other algorithm, i.e., an order is defined -- here it is just lexicographical ordering of the serialized path. During processing, any path that would produce a lexicographically-longer path is dropped / can be short-circuited because it will not be chosen. Once the lexicographically-shortest path is found, it is hashed to produce a new temporary label for the blank node that is being processed. The temporary label mapping used to produce the chosen path is also saved -- because it can be used to assign canonical labels in case the process produces the same hash for multiple blank nodes. The hashes are lexicographically-sorted and the blank nodes are then labeled in order.

Again, if I have not misunderstood the two algorithms, they seem to be very similar in both steps, but slightly differ in ways that produce different output. It also seems to be the case that if there are more significant differences, it would be in the more complex second step.

I think it's important to note that the more complex second step is unlikely to be run for the vast majority of usable real world data -- or perhaps run with a single recursion for some small set of real world data. I would expect it to only run longer for graphs that are probably "poison graphs", i.e., they are not actually usable but instead are designed to make the algorithms run in worst-case time. My suggestion for our spec, regardless of the algorithm used, is to make it, by default, throw an error when the second step needs to do anything significant (to be defined by number of recursions or similar, etc.). Of course, a flag could be passed to allow it to run to completion to handle the special cases, but real world applications will be more secure against DoS attacks if implementations throw by default.

We should consider the above recommendation (if others agree it a good one) in deciding how important it is to spend time on the various different choices to be made in the second step. It is well known that the graph isomorphism problem is presently thought to be in the computational complexity class of NP-intermediate, and any algorithm we use to fully canonize an arbitrary dataset must solve this problem. I (and I think others working in this space) expect that real world, usable data doesn't hit the worst case. My understanding of Aidan's presentation indicates that even when huge sets of such data (from a variety of data published on the Web) have been processed, the worst case hasn't been hit. I feel confident suggesting it is extremely unlikely to occur in data such as Verifiable Credentials.

iherman commented 2 years ago

I think it's important to note that the more complex second step is unlikely to be run for the vast majority of usable real world data -- or perhaps run with a single recursion for some small set of real world data. I would expect it to only run longer for graphs that are probably "poison graphs", i.e., they are not actually usable but instead are designed to make the algorithms run in worst-case time. My suggestion for our spec, regardless of the algorithm used, is to make it, by default, throw an error when the second step needs to do anything significant (to be defined by number of recursions or similar, etc.). Of course, a flag could be passed to allow it to run to completion to handle the special cases, but real world applications will be more secure against DoS attacks if implementations throw by default.

Huge 👍 from my side.

We should consider the above recommendation (if others agree it a good one) in deciding how important it is to spend time on the various different choices to be made in the second step. It is well known that the graph isomorphism problem is presently thought to be in the computational complexity class of NP-intermediate, and any algorithm we use to fully canonize an arbitrary dataset must solve this problem. I (and I think others working in this space) expect that real world, usable data doesn't hit the worst case. My understanding of Aidan's presentation indicates that even when huge sets of such data (from a variety of data published on the Web) have been processed, the worst case hasn't been hit. I feel confident suggesting it is extremely unlikely to occur in data such as Verifiable Credentials.

Actually, it may be interesting to have some (certainly not exhaustive) characterization of c14n friendly graphs, which may mean that graphs may be created by users to possibly avoid pitfalls. For example, I have the feeling (though that must be proven), that "well behaving" graphs[1] are probably c14n friendly (and, as argued in [1]) this may cover a large percentage of RDF graphs out there (and I have again the feeling that all VC are well behaving).

[1] http://dbooth.org/2013/well-behaved-rdf/Booth-well-behaved-rdf.pdf
in short, it says that a "well behaving" graph is one that can be serialized in Turtle without any explicit B-node identifier. Is it the same as to say that the graph is a forest, ie, a collection of tree structures?

dlongley commented 2 years ago

@iherman,

For example, I have the feeling (though that must be proven), that "well behaving" graphs[1] are probably c14n friendly (and, as argued in [1]) this may cover a large percentage of RDF graphs out there (and I have again the feeling that all VC are well behaving).

Yeah, "well behaving" graphs as described in that link are "c14n friendly". I think that there are also other "c14n friendly" graphs, though, so it's a little bit too narrow to limit that definition to "graphs that can be serialized in Turtle without any explicit B-node identifier". One way we could define "c14n friendly" graphs is to just have people use the c14n algorithm and see if it passes using default parameters :). Internally, we would throw as mentioned above. If there's a better way to provide such a definition that could be easily understood by people, I'm +1.

pchampin commented 2 years ago

If there's a better way to provide such a definition that could be easily understood by people, I'm +1.

Like @iherman, i'm not confident that an exhaustive definition would be easy to understand (or check, apart from testing the first step of the c14n algorithm).

But there is value in characterizing several subsets of "c14n friendly" graphs. In addition to "graphs that can be represented without any bnode label in Turtle (or JSON-LD)", what comes to mind is "graphs where each blank node is involved in a 'unique' triple" (e.g. they all have an outgoing arc (b, p, o) where no other blank node has the same p, o.

pchampin commented 2 years ago

Actually, I'm not even sure that "graphs that can be represented without any bnode label in Turtle (or JSON-LD)" are always c14n friendly 🤔... If they are not lean, wouldn't they need the 2nd step? E.g.:

<#pa> :knows
  [ a foaf:Person; foaf:name "dave" ],
  [ a foaf:Person; foaf:name "dave" ].
afs commented 2 years ago

probably "poison graphs"

We have to write a security section that highlights the risks.

One defence is time-bounding the algorithm execution.

That does not produce a predicable output as to how far the algorithm got to.

(to be defined by number of recursions or similar, etc.).

So I don't think we define this in any formal sense but rather describe techniques may be applicable.

If there are formal point in the algorithm text, it suggests, for any specific implementation, a deterministic outcome as to how far the implementation got. Keep the algorithm text (a tiny bit) simpler.

aidhog commented 2 years ago

Hi all, regarding the algorithms, as Dave mentions, they both have a lot of similarities, particularly in the first step, which involves hashing local information for each blank node (the triples/quads in which they directly appear). One practical difference is that the blabel algorithm (the one I worked on) only works for triples/graphs, not quads/datasets (but the extension would be straightforward I think).

The second step of both algorithms, which involves distinguishing blank nodes with seemingly similar local information, has a similar basis, but proceeds in quite a different way. The second step in blabel is just to run the first step multiple times. If there are still blank nodes that cannot be distinguished, then a third step of trying to artificially distinguish blank nodes and look at the graphs that are created is invoked. In URDNA2015, the second step is to create paths from blank nodes to other (thus far non-canonically labelled) blank nodes, and induce an order on them.

The effects of those differences on practical issues like efficiency are not super clear to me. It's possible one or the other algorithm might be more efficient.

Regarding the need for the second step, indeed as @pchampin mentions, this case is not all that uncommon, and it is not related to well-behaved RDF, or similar. It may be necessary to enter the second step in the non-lean case that Pierre mentions, but there are also lean cases where the second step would be necessary. Specifically, the second step is necessary if there are blank nodes whose local information is identical when we ignore blank node labels. A "well-behaved" lean case would be:

<#pa> :knows
  [ a foaf:Person; ex:address [ ex:city ex:NY ; ex:postcode "5678" ] ],
  [ a foaf:Person; ex:address [ ex:city ex:LA ; ex:postcode "1234" ] ].

The two "people" blank nodes that <#pa> knows are identical in terms of local information, whose "signature" looks something like:

<#pa> :knows * . * a foaf:Person . * ex:address # .

where * indicates the current blank node, and # a different blank node. This would be addressed by the N-Degrees part of URDNA2015, or by recursively applying the first step of blabel.

But I think there's different "levels" of hard here. In experiments over BTC 2014 (a crawl of RDF documents) with blabel, I don't have a count of how many graphs were distinguished straight away, but there were about one thousand graphs (a small fraction of all graphs, but a not insignificant absolute number) that entered into the "hardest parts" of the algorithm, i.e., that timed out after 10 minutes without the more advanced automorphism detection optimisation in blabel (deep in the guts of the "third part" of the algorithm).

Overall, to compare the algorithms or understand how they could be combined, it might help to have a clearer idea of what the requirements are, in terms of what sorts of cases do want to cover, what expectations there are for efficiency, etc. Maybe from there we could create test-cases, experiments, etc.

ahmad88me commented 1 year ago

Do we have a comparison table somewhere? if not shall we create one?

philarcher commented 1 year ago

The group's dynamic has focused on the URDNA2015 work. We have spent several calls and email threads considering different algorithms and have arrived at the current state. There has been no push to change the basic algorithm from that since this issue was last commented upon. Therefore we propose that the issue be closed.