Run tests against RDF normalization test suite

dlongley commented 6 years ago

Hi @iherman! Thanks for doing this PoC!

Do you think you could run this implementation against some of the tests from the RDF normalization test suite? The expected output will be different of course from the existing algorithms, but a number of the tests relabel blank nodes in isomorphic datasets such that the same output should be generated.

It would be especially interesting to see how it does with the "evil" tests:

https://github.com/json-ld/normalization/blob/gh-pages/tests/test044-in.nq https://github.com/json-ld/normalization/blob/gh-pages/tests/test045-in.nq https://github.com/json-ld/normalization/blob/gh-pages/tests/test046-in.nq

Each of these should produce the same output.

iherman commented 6 years ago

@dlongley this is mostly an ack to your message; I am currently on a trip, ie, I cannot do anything more seriously. I had actually copied your full test suite on my machine already, but I did not have yet the time to systematically run my version on all of them.

That being said: these mean and evil examples seem to break my implementation, that much is sure. I do not know whether it is a problem with the algorithm or my implementation; most probably the latter...

Thanks.

iherman commented 6 years ago

Hey @dlongley: it needed some time after my return. Actually, your evil tests have revealed two very nasty bugs, one related to a fix I did while I was on my trip (never a good idea!!). But I found them and updated the repository.

The result is yes, those three pass and produce the same canonical graph. It obviously goes through a longer search so it takes 4-5 seconds on my machine. The paper of Aidan has an optimization steps that does not change the mathematics, but speeds up things significantly, but I did not implement it (so far?).

B.t.w., Aidan also ran these tests on his (Java) implementations, with proper results; his implementation includes the optimization, of course, and they take (per his email) about 100ms.

dlongley commented 6 years ago

@iherman,

That's good. We should, at some point, benchmark the various implementations (of this algorithm and the others) against the test suite to see what the landscape looks like. I think the current JavaScript implementation (of URDNA2015) handles the evil tests in ~40ms or less -- but we don't know what the hardware characteristics are here on the various machines so we can't yet figure out a real comparison or determine what's going on. I'm also aware that some significant optimizations could be made to the JavaScript implementation.

I think it's more important that the implementations do a good job with the common cases, of course. It sounds like we could start producing some useful data here though!

aidhog commented 6 years ago

Hey all, just to mention that if needed, I could contribute some more difficult synthetic test cases (e.g., where the blabel algorithm I implemented fails). I'm not sure how important such cases are in practice or for the definition of a standard, but let me know if such tests would be of interest. They are essentially some of the difficult cases from the graph isomorphism community. (And yes, they are very weird.)

dlongley commented 6 years ago

@aidhog,

Yes, those would be of interest. If you could, please contribute them to the test suite so we can at least evaluate them and how the implementations react when they are encountered.

aidhog commented 6 years ago

@dlongley, okay! So I have 216 synthetic test cases for varying sizes of the following classes of graph:

20 x 2D-Grids
18 x 3D-Grids
99 x Cliques
27 x Rook graphs
25 x Miyazaki graphs
27 x Triangle graphs

How would we proceed? I would be reluctant to just dump them into the current test suite since I guess the current examples are selected with use-cases in mind? Perhaps they should be uploaded separately elsewhere? (Also they are currently .nt rather than .nq.)

aidhog commented 6 years ago

They are not so large, so I can attach them here I guess.

eval-rdf.zip

These were selected and "converted" to RDF from benchmarks for Bliss. Credit to Petteri Kaski for creating the original undirected graphs.

iherman commented 6 years ago

@dlongley

That's good. We should, at some point, benchmark the various implementations (of this algorithm and the others) against the test suite to see what the landscape looks like. I think the current JavaScript implementation (of URDNA2015) handles the evil tests in ~40ms or less -- but we don't know what the hardware characteristics are here on the various machines so we can't yet figure out a real comparison or determine what's going on. I'm also aware that some significant optimizations could be made to the JavaScript implementation.

I agree. Just to make it clear: as I said in my readme, I do not consider my implementation to be "production quality": I am sure it would need some profiling (which I did not do) and find the points where it could be speeded up. Let alone the fact that I do not consider myself an expert Javascript programmer, i.e., I am sure I did fell into some pitfalls in this respect...

iherman commented 6 years ago

@aidhog @dlongley I am not sure where to put these tests; it would be good to gather them at one place. I think that the test suite of @dlongley is, obviously, much more extensive and complete (I used some of those, not all reflected in the test suite of this repository). I could see merging the tests of @aidhog into those of @dlongley or, alternatively, I am happy setting up a separate repository where we can collect those. I am neutral...

davidlehn commented 6 years ago

Comments on the "bliss" files:

As stated on their website, these are "benchmarks". They look to be useful for performance testing, but how do we use them to determine algorithm correctness?
I'm guessing we wouldn't need to run all of those tests just to check correctness. A much smaller subset would likely do just fine for that purpose.
Even for benchmarking, probably don't need to run all of the tests of any particular type.
We'd need to provided attribution for these tests and ensure it's ok to use them in our test suites. Unclear from that site what the license on the data is (is that even handled the same as code?).
We'd want to convert to n-quads. Considering the size of the larger tests, it probably makes sense to re-convert from the source format and use simpler strings. (ie, "http://ex.org/p" is longer than needed for this use case)
As long as we're talking about converting to n-quads and perhaps changing the contents anyway, maybe we could figure out how to deal with the entire bliss suite vs just a selected part.

aidhog commented 6 years ago

@davidlehn, I will respond in order with some thoughts.

Regarding benchmarks versus correctness: Indeed these are all intended as benchmarks, but one can define also a correctness framework that (1) shuffles the triples, and (2) renames the blank nodes one-to-one with pseudo-random labels. This works for any input graph and covers all the possible variants of isomorphic inputs (all the possible positive cases for that graph); hence it is trivial to compute any number of positive isomorphic cases for any input graph. The graphs offer a good correctness test since they will reach parts of code that a typical "real-world" RDF graph would usually not.
Regarding need to run all for testing correctness: Indeed probably not all graphs are necessary for correctness, but there is a tricky issue of hash collisions, which for me only arose in particular cases of cliques and Miyazaki graphs. Deciding which graphs are useful for correctness would be difficult beforehand (but maybe a couple per class could be chosen).
Regarding need to run all for benchmarking: In the benchmarks I performed, I started with the smallest and either (1) ran them all if they all finished in under 10 minutes or (2) stopped at the first (smallest) that exceeded 10 minutes. I agree not all are interesting, but again it may be difficult to know before running them which are interesting or not.
Regarding attribution: I agree that attribution should be provided and the situation clarified, but I don't think you can copyright a clique or a 2d-grid. :)
Regarding n-quads: To keep the cases difficult, I would propose to use one graph in each (maybe a blank node or IRI). If the triples are split into many graphs, cases become much more trivial. Regarding the length of the predicate, I don't think an absolute IRI can be much shorter? But it would not be a problem for me to change this (or to add a graph element).
Regarding entire bliss suite: I have Java code that can convert any instance.

Just to emphasise, I do not necessarily think these graphs should necessarily be merged with the existing test-cases; I think they could also be kept separate. Like @iherman , I am neutral and happy to help with whatever the group decides.

kortschak commented 3 years ago

I have implemented this in Go with both the simple approach and with the pre-splitting approach using algorithm 2 in the paper.

I have adopted the suggestion in the readme here for allowing hashing of quad labels. Thanks to @iherman for the excellent notes in that document, and to @aidhog for help dealing with some issues I had initially.

I've implemented a test harness that does a number of things using the test cases in https://github.com/json-ld/normalization/blob/gh-pages/tests (these can be licensed BSD 3-clause). It runs the hash on the original and then compares that with the original with:

reversed statement order
randomly permuted statement order
randomly purmuted statement order
blank labels replaced with a hash of the label text
1+4
2+4

Then test to ensure non-collision of different graphs are made, checking that the original does not match a derivative that:

has the last statement removed
has the first non-blank term hound replaced with a literal of the hash of the text throughout the graph
has the first to lexically ordered blank nodes fused into a single blank node

The algorithm as it is written work for all with no pre-splitting but failed for 44, 45 and 46 (the graphs in the OP) when pre-splitting is done. This turns out to be because the two splits that arise have the same hash, but lead to different final hashes. Adding a loop over co-equal hashes in the body of the distinguish function fixes this. 44-46 are good graphs to have in the test suite since they have this characteristic.

If there are other graphs that should be included please suggest.

I have also added benchmarking for a couple of cases (one plain and one evil). ~No~ Some non-obvious optimisation for reduction of allocation/GC pressure has been done ~yet~, which is where the greatest performance improvements are likely to be made.

name                                             time/op
IsoCanonicalHashes/test019-in.nq/decomp=false-8  26.4µs ± 5%
IsoCanonicalHashes/test019-in.nq/decomp=true-8   29.8µs ± 4%
IsoCanonicalHashes/test044-in.nq/decomp=false-8   429ms ± 2%
IsoCanonicalHashes/test044-in.nq/decomp=true-8    441ms ± 2%

name                                             alloc/op
IsoCanonicalHashes/test019-in.nq/decomp=false-8  16.5kB ± 0%
IsoCanonicalHashes/test019-in.nq/decomp=true-8   19.5kB ± 0%
IsoCanonicalHashes/test044-in.nq/decomp=false-8  82.1MB ± 0%
IsoCanonicalHashes/test044-in.nq/decomp=true-8   81.3MB ± 0%

name                                             allocs/op
IsoCanonicalHashes/test019-in.nq/decomp=false-8     215 ± 0%
IsoCanonicalHashes/test019-in.nq/decomp=true-8      249 ± 0%
IsoCanonicalHashes/test044-in.nq/decomp=false-8   1.51M ± 0%
IsoCanonicalHashes/test044-in.nq/decomp=true-8    1.51M ± 0%

For comparison, here is the same benchmark on a freshly minted URDNA2015/URGNA2012 written with the same level of care.

name                                             time/op
URNA/test019-in.nq/URDNA2015-8                   8.52µs ± 6%
URNA/test019-in.nq/URGNA2012-8                   8.06µs ± 8%
URNA/test044-in.nq/URDNA2015-8                   8.13ms ± 5%
URNA/test044-in.nq/URGNA2012-8                   6.46ms ± 2%

name                                             alloc/op
URNA/test019-in.nq/URDNA2015-8                   4.43kB ± 0%
URNA/test019-in.nq/URGNA2012-8                   4.29kB ± 0%
URNA/test044-in.nq/URDNA2015-8                   3.43MB ± 0%
URNA/test044-in.nq/URGNA2012-8                   3.17MB ± 0%

name                                             allocs/op
URNA/test019-in.nq/URDNA2015-8                     80.0 ± 0%
URNA/test019-in.nq/URGNA2012-8                     80.0 ± 0%
URNA/test044-in.nq/URDNA2015-8                    55.3k ± 0%
URNA/test044-in.nq/URGNA2012-8                    55.2k ± 0%

I suspect the difference in performance between the Java and Go implementation is down to the nature of the GC; AFAIU, Java has generational GC which makes recursive algorithm like this work better there. Go does not.

This work is a PR in the Gonum project, gonum/gonum#1524 (currently failing due to bad linting of generated code, but otherwise good).

iherman / canonical_rdf

Run tests against RDF normalization test suite #2