Open mitsuhiko opened 3 years ago
I found a way to easily implement the distinct-line heuristic. It doesn't modify the Myers algorithm at all, and rather builds on top of it. The algorithm works in three steps:
1. Encode the input lists of type Vec<T>
into an "optimized" representation:
enum Elems<T> {
UniqueRun(Vec<T>),
NormalElem(T),
}
In this representation, consecutive unique lines are concatenated into UniqueRun
, and non-unique lines are just inside NormalElem
. The whole sequence is then Vec<Elems<T>>
.
2. Perform Meyers diff on Vec<Elems<T>>
, instead of Vec<T>
directly.
3. Decode Vec<Elems<T>>
back into Vec<T>
to get the result.
This solves the super common pathological case of nearly-distinct files. However, if unique and non-unique lines are mixed together, it still fails.
I implemented this algorithm for the Pijul crate. I don't know how easy/hard it would be to adapt for this crate. Just leaving this here in case it's useful.
Thank you for that @potocpav. I will have a look and evaluate this. The underlying design is still somewhat similar to pijul so it should be easy enough to adapt.
More specifically, this project started as a fork of "diffs", right?
Yep. See also https://github.com/mitsuhiko/similar/issues/1
GNU diff and others have some internal heuristics to bail if there are too many changes. There are basically two optimizations:
https://github.com/reviewboard/reviewboard/blob/master/reviewboard/diffviewer/myersdiff.py
To be more aligned with git it might make sense to implement the heuristics in the current
Algorithm::Myers
variant and have a secondaryAlgorithm::MyersMinimal
which has these heuristics disabled (git calls the variantsmyers
andminimal
).These heuristics are likely needed as currently
lcs
outperformmyers
greatly if used on completely distinct files.