Experiment: diff by ngram

2-grams are useful for speeding up diffing a collection against its reverse, but for very large, highly structured collections with relatively small alphabets (like binary data) it should be possible to perform something more like:

na = [a[0..<sz], a[1..<(sz+1)], …, a[a.count-(sz+1)..<a.count]]
nb = [b[0..<sz], b[1..<(sz+1)], …, b[b.count-(sz+1)..<b.count]]
d = difference(from:na, to: nb)
dEnd = difference(from:na.last, to: nb.last)

The offsets in the diffs d and dEnd should contain all the offset information needed to produce a diff that is valid from a to b.

The downside to this approach is that it is highly dependent on order to work well. Diffs between randomized binary collections that could have a 50% match rate (by editing all 1s in order to match all the 0s) would produce diffs that are nearly n removals and m insertions.

But that might not be a bad thing. One performance goal this could bring within reach is the ability to diff btree.f55ea8f456.c and btree.79ce96ab39.c by character (or even to diff one of them by character against its reverse)

I'm not actually very confident in this idea though; ngrams are a comparison multiplier and na is still going to be size n - sz.

But it's worth writing the idea down, at least.

numist / Diffing-Explorations

Experiment: diff by ngram #10