Chars diff: DiffOp's length is different from str::len

danieljl commented 2 years ago

Hi,

First, thank you for this great crate.

I expected the length field of DiffOp to be the same as str::len, i.e. the length of the resulting bytes if the text is encoded in UTF-8. They turned out to be different. The former is instead the same as the number of Unicode scalar values (~ code points). Is this a bug or expected? If it's expected, is there a way to get the "bytes-length" from a DiffOp?

Minimal working example:

use similar::{DiffOp, TextDiff};

fn main() {
    let new = "á";
    let diff = TextDiff::from_chars("", new);

    let op = diff.ops()[0];
    if let DiffOp::Insert { old_index, new_index, new_len } = op {
        let real_new_len = new.len();
        let char_count = new.chars().count();
        println!("new_len = {new_len}, real_new_len = {real_new_len}, char_count = {char_count}");
    } else {
        unreachable!();
    }
}

The code above will output:

new_len = 1, real_new_len = 2, char_count = 1

Tested on v2.1.0 and main branch (236a299ff01b8d4bdfc95c6439c1302c8422ae13).

mitsuhiko commented 2 years ago

That's intentional or rather that's just how the system works, but it can be annoying. The solution to this is using the TextDiffRemapper:

use similar::TextDiff;
use similar::utils::TextDiffRemapper;

fn main() {
    let old = "";
    let new = "á";
    let diff = TextDiff::from_chars(old, new);
    let remapper = TextDiffRemapper::from_text_diff(&diff, old, new);
    let changes: Vec<_> = diff.ops()
        .iter()
        .flat_map(move |x| remapper.iter_slices(x))
        .collect();

    dbg!(changes);
}

danieljl commented 2 years ago

Thanks for the pointer!

mitsuhiko / similar

Chars diff: DiffOp's length is different from str::len — Bug or expected? #35