mitsuhiko / similar

A high level diffing library for rust based on diffs
https://insta.rs/similar
Apache License 2.0
923 stars 31 forks source link

Implement Myers Heuristics #15

Open mitsuhiko opened 3 years ago

mitsuhiko commented 3 years ago

GNU diff and others have some internal heuristics to bail if there are too many changes. There are basically two optimizations:

https://github.com/reviewboard/reviewboard/blob/master/reviewboard/diffviewer/myersdiff.py

To be more aligned with git it might make sense to implement the heuristics in the current Algorithm::Myers variant and have a secondary Algorithm::MyersMinimal which has these heuristics disabled (git calls the variants myers and minimal).

These heuristics are likely needed as currently lcs outperform myers greatly if used on completely distinct files.

potocpav commented 2 years ago

I found a way to easily implement the distinct-line heuristic. It doesn't modify the Myers algorithm at all, and rather builds on top of it. The algorithm works in three steps:

1. Encode the input lists of type Vec<T> into an "optimized" representation:

enum Elems<T> {
    UniqueRun(Vec<T>),
    NormalElem(T),
}

In this representation, consecutive unique lines are concatenated into UniqueRun, and non-unique lines are just inside NormalElem. The whole sequence is then Vec<Elems<T>>.

2. Perform Meyers diff on Vec<Elems<T>>, instead of Vec<T> directly.

3. Decode Vec<Elems<T>> back into Vec<T> to get the result.

This solves the super common pathological case of nearly-distinct files. However, if unique and non-unique lines are mixed together, it still fails.

I implemented this algorithm for the Pijul crate. I don't know how easy/hard it would be to adapt for this crate. Just leaving this here in case it's useful.

mitsuhiko commented 2 years ago

Thank you for that @potocpav. I will have a look and evaluate this. The underlying design is still somewhat similar to pijul so it should be easy enough to adapt.

P-E-Meunier commented 2 years ago

More specifically, this project started as a fork of "diffs", right?

mitsuhiko commented 2 years ago

Yep. See also https://github.com/mitsuhiko/similar/issues/1