Open numist opened 4 years ago
I suspect this might help a lot with testByPercentageChangedBitBuffers
and testRandomBitBuffers
. The resulting diff would be very large, but maybe that's still ok given that from a human perspective the collections do not share a lot in the way of high-level structure.
Really, this experiment is just hoisting the idea of n-gram diffing up a layer, using it as the element instead of an internal heuristic, so the n-gram computation from the README would probably still be appropriate:
len(10000110 10111111) = 16
log₂(16) = 4
2-grams are useful for speeding up diffing a collection against its reverse, but for very large, highly structured collections with relatively small alphabets (like binary data) it should be possible to perform something more like:
The offsets in the diffs
d
anddEnd
should contain all the offset information needed to produce a diff that is valid froma
tob
.The downside to this approach is that it is highly dependent on order to work well. Diffs between randomized binary collections that could have a 50% match rate (by editing all
1
s in order to match all the0
s) would produce diffs that are nearlyn
removals andm
insertions.But that might not be a bad thing. One performance goal this could bring within reach is the ability to diff
btree.f55ea8f456.c
andbtree.79ce96ab39.c
by character (or even to diff one of them by character against its reverse)I'm not actually very confident in this idea though; ngrams are a comparison multiplier and
na
is still going to be sizen - sz
.But it's worth writing the idea down, at least.