numist / Diffing-Explorations

A playground for diffing experiments along with the algorithms and data structures needed to support them
9 stars 0 forks source link

Experiment: diff by ngram #10

Open numist opened 4 years ago

numist commented 4 years ago

2-grams are useful for speeding up diffing a collection against its reverse, but for very large, highly structured collections with relatively small alphabets (like binary data) it should be possible to perform something more like:

na = [a[0..<sz], a[1..<(sz+1)], …, a[a.count-(sz+1)..<a.count]]
nb = [b[0..<sz], b[1..<(sz+1)], …, b[b.count-(sz+1)..<b.count]]
d = difference(from:na, to: nb)
dEnd = difference(from:na.last, to: nb.last)

The offsets in the diffs d and dEnd should contain all the offset information needed to produce a diff that is valid from a to b.

The downside to this approach is that it is highly dependent on order to work well. Diffs between randomized binary collections that could have a 50% match rate (by editing all 1s in order to match all the 0s) would produce diffs that are nearly n removals and m insertions.

But that might not be a bad thing. One performance goal this could bring within reach is the ability to diff btree.f55ea8f456.c and btree.79ce96ab39.c by character (or even to diff one of them by character against its reverse)

I'm not actually very confident in this idea though; ngrams are a comparison multiplier and na is still going to be size n - sz.

But it's worth writing the idea down, at least.

numist commented 4 years ago

I suspect this might help a lot with testByPercentageChangedBitBuffers and testRandomBitBuffers. The resulting diff would be very large, but maybe that's still ok given that from a human perspective the collections do not share a lot in the way of high-level structure.

Really, this experiment is just hoisting the idea of n-gram diffing up a layer, using it as the element instead of an internal heuristic, so the n-gram computation from the README would probably still be appropriate:

len(10000110 10111111) = 16
log₂(16) = 4