mitsuhiko / similar

A high level diffing library for rust based on diffs
https://insta.rs/similar
Apache License 2.0
949 stars 31 forks source link

`Index` trait bound makes it impossible to return an owned value #33

Open rocurley opened 2 years ago

rocurley commented 2 years ago

I'm working on json diffing. I have some representation of two jsons in memory, like a serde_json::Value. Obviously these cannot be directly diffed, since they're not array-like. So to diff them I need to flatten them into an array of json-lines. I'd rather not actually construct the array, however: I don't want the memory usage. Instead, I'd like to generate the lines on the fly.

I can almost do this by implementing the Index trait. It doesn't quite work, however, since I need to return a reference. Ideally, I'd like to be able to pass in an FnMut(usize) -> T or something along those lines.

I'm not sure if anybody but me will find this useful, TBH, but I figured there's no harm in asking.

mitsuhiko commented 2 years ago

@rocurley Why can you not produce a &str (or similar). What the Index trait expands to does need to be a reference, but I did not find it massively to be a problem so far. Do you have an example where it does not work for you currently?

rocurley commented 2 years ago

So my actual use case is diffing jsons. But you don't need the full complexity of json here to have this issue. Say you want to diff Vec<Vec<u8>>s. Now in principle you could just pass that in directly and be fine. But the diff will then consider the inner Vec<u8>s to be the unit of diffing, so the diffs will be too coarse. [[1,2]] -> [[1,2,3]] shouldn't say the entire input has changed, for example.

So one way to do this is to flatten the input into a vector representation that allows you to reconstruct the original input:

enum DiffElem {
    VecStart,
    Elem(u8),
}

fn to_diffable(xss : &Vec<Vec<u8>>) -> Vec<DiffElem> {
    let mut out : Vec<DiffElem> = Vec::new();
    for xs in xss {
        out.push(DiffElem::VecStart);
        out.extend(xs.iter().map(|x| DiffElem::Elem(*x)));
    }
    out
}

Now this will work, but requires making a copy of the entire input. But it doesn't seem strictly necessary. You could "index" into the diffable representation without ever constructing it:

// This implementation is ineffecient. To get a more effecient one, we'd
// annotate each subvector with the index of its VecStart, and use
// binary search to find the subvector we care about.
fn index_diffable(mut i : usize, xss : &Vec<Vec<u8>>) -> Option<DiffElem> {
    for xs in xss {
        if i == 0 {
            return Some(DiffElem::VecStart);
        }
        if i < xs.len() {
            return Some(DiffElem::Elem(xs[i]));
        }
        i -= xs.len();
    }
    None
}

But you can't do this with the Index trait.

mitsuhiko commented 2 years ago

I agree that this case is not great. Generally I'm not a huge fan of the Index trait as base. I have already considered alternatives before but never with too much success.

kirawi commented 2 years ago

I've been trying to implement a wrapper type to get ropey::RopeSlice to work with similar using the iterators it provides, but the problem is that they only return owned values with the exception of the Chunks iterator (which isn't what I need).

I have already considered alternatives before but never with too much success.

If you don't mind me asking, what did you try and why weren't they successful?

mitsuhiko commented 2 years ago

The original issue I had with Index wasn't even that it returns references but that it can be hard to get the bounds checks reliably eliminated (#19). So originally I was trying to do that.

To address this particular thing with borrows I tried to have a custom trait but making both borrows and owned values work is surprisingly complex and I did not go very far with it because I did not at all like what it did to the API. I'm not sure if there are clever ideas I haven't been thinking of yet that would make both cases work.

cessen commented 2 years ago

We're running into this in Helix (using Ropey) again.

Might I suggest adding a variant of the capture_diff() and capture_diff_deadline() functions that take a |usize| -> Item closure? That would give a lot of flexibility to client code to work with whatever weird containers/situations they might have.

Edit: or rather a |usize| -> Option<Item> closure, since the collections are presumably not infinitely long, ha ha. :-)

mitsuhiko commented 2 years ago

@cessen the issue is that this Index based API is all over the place. This goes really deep in the library.

I will play with it. Maybe I just break the internal algorithms API.

cessen commented 2 years ago

Ah, got it. If I get some time and the mood strikes, I might take a crack at that as well. If we're lucky, even if the API goes deep, it still might not be too hard to separate from the algorithm code itself. If we're lucky... :-)

mitsuhiko commented 2 years ago

The other issue with the callback API is that it can't borrow. So it's not even that the callback based API could be the base of the API.

kirawi commented 2 years ago

Perhaps a macro that could create two versions of the internal functions; one that borrows and one that returns an owned value?

mitsuhiko commented 2 years ago

So one could move somewhat to improving this by going all in on the IdentifyDistinct type. So something like this could be done:

let old = Rope::from_str("...");
let new = Rope::from_str("...");
let h = IdentifyDistinct::<u32>::new_from_iter(old.lines(), new.lines());
let diff = capture_diff(
    Algorithm::Myers,
    h.old_lookup(),
    h.old_range(),
    h.new_lookup(),
    h.new_range(),
);

The new_from_iter is easy to add, but the returned ops are still annoying to work with. For instance iter_changes again requires an Index bound.

mitsuhiko commented 1 year ago

So now that there are GATs that issue might be worth toying around with again.