johannhof / difference.rs

Rust text diffing and assertion library
https://docs.rs/difference
MIT License
242 stars 33 forks source link

split words via regex #16

Open colin-kiegel opened 7 years ago

colin-kiegel commented 7 years ago

difference can currently split via a single character, where " " is suggested to achieve word-level splits.

However other whitespaces, like tabs would not lead to word-splits. And punctuation does also not lead to a word-split.

Example:

Therefore difference will treat these strings as completely different, since no word is identical.

Suggestion:

So, difference would be able to detect some overlap in the given example and only treat ho! as an insertion! :-)

The difference crate could also export a reasonable default regexp to split words. However what you want will most likely very much depend on the context.


This suggestion is based on git diff --word-diff-regex=<regex>

Use <regex> to decide what a word is, instead of considering runs of non-whitespace to be a word.

Every non-overlapping match of the <regex> is considered a word. Anything between these matches is considered whitespace and ignored(!) for the purposes of finding differences. You may want to append |[^[:space:]] to your regular expression to make sure that it matches all non-whitespace characters. A match that contains a newline is silently truncated(!) at the newline.

johannhof commented 7 years ago

Sounds reasonable, but I won't have time to work on this in the near future. Happy to take + review PRs. :)

jayzhan211 commented 4 years ago

Is this the reason why I get edit distance 2 on a="9", b="99" when split is " " or "\n"?

version 2.0.0

    #[test]
    fn test_diff() {
        init();
        let a = "9";
        let b = "99";
        assert_diff!(a, b, " ", 0);
    }

If the edit distance is based on LCS, this should be 1 the same as split=""

'assertion failed: edit distance between "9" and "99" is 2 and not 0, see diffset above'