rust-itertools / itertools

Extra iterator adaptors, iterator methods, free functions, and macros.
https://docs.rs/itertools/
Apache License 2.0
2.62k stars 298 forks source link

Add comm iterator (inspired by Unix comm command)? #963

Open VorpalBlade opened 5 days ago

VorpalBlade commented 5 days ago

I recently had the need for comparing sorted iterators to figure out which elements were only in the left, right or both iterators. This is basically the same operation as comm(1) but generalised to iterators.

So I wrote an implementation that works for my purposes. However, this is a side thing in the crate I'm working on, it doesn't fit as an exposed API from it. So I wonder if there is any interest in cleaning up this and going through the effort of making a PR for itertools (where it seems like a good fit).

Below is the basic implementation (excluding tests, Clone, Debug, FusedIterator and other things that aren't core to the API). I left the doctest in to show how it is used. There are probably ways to improve all of this, but lets see if there is any interest in this at all before I spend time on that.

/// Describes which iterator(s) the item is present in
#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Clone, Copy, Hash)]
pub enum Comm<T> {
    /// The item is only present in the left iterator
    LeftOnly(T),
    /// The item is present in both iterator
    Both(T),
    /// The item is only present in the right iterator
    RightOnly(T),
}

/// An iterator that compares two sorted iterators of items
///
/// See [`comm`] for more information.
pub struct CommIterator<L: Iterator, R: Iterator> {
    left_iter: std::iter::Peekable<L>,
    right_iter: std::iter::Peekable<R>,
}

impl<L, R> Iterator for CommIterator<L, R>
where
    L: Iterator,
    R: Iterator<Item = L::Item>,
    L::Item: Ord,
    L::Item: PartialEq,
    L: FusedIterator,
    R: FusedIterator,
{
    type Item = Comm<L::Item>;

    fn next(&mut self) -> Option<Self::Item> {
        // We need to run next after the peek to placate the borrow checker.
        match (self.left_iter.peek(), self.right_iter.peek()) {
            (None, None) => None,
            (None, Some(_)) => {
                let val = self.right_iter.next().expect("We just peeked");
                Some(Comm::RightOnly(val))
            }
            (Some(_), None) => {
                let val = self.left_iter.next().expect("We just peeked");
                Some(Comm::LeftOnly(val))
            }
            (Some(l), Some(r)) => match l.cmp(r) {
                Ordering::Equal => {
                    let val = self.left_iter.next().expect("We just peeked");
                    self.right_iter.next().expect("We just peeked");
                    Some(Comm::Both(val))
                }
                Ordering::Less => {
                    let val = self.left_iter.next().expect("We just peeked");
                    Some(Comm::LeftOnly(val))
                }
                Ordering::Greater => {
                    let val = self.right_iter.next().expect("We just peeked");
                    Some(Comm::RightOnly(val))
                }
            },
        }
    }

    fn size_hint(&self) -> (usize, Option<usize>) {
        let left_hint = self.left_iter.size_hint();
        let right_hint = self.right_iter.size_hint();

        // At least the longer of the two iterators
        let min = left_hint.0.max(right_hint.0);
        // At most the sum of the lengths of the two iterators
        let max = left_hint.1.and_then(|l| right_hint.1.map(|r| l + r));

        (min, max)
    }
}

/// Compare two sorted slices of items
///
/// Example use case
/// ```
/// use itertools::{comm, Comm};
/// let results: Vec<Comm<&i32>> = comm(
///     [1, 2, 3, 4, 5, 8].iter(),
///     [3, 4, 5, 6, 7].iter()).collect();
/// assert_eq!(results, vec![
///    Comm::LeftOnly(&1),
///    Comm::LeftOnly(&2),
///    Comm::Both(&3),
///    Comm::Both(&4),
///    Comm::Both(&5),
///    Comm::RightOnly(&6),
///    Comm::RightOnly(&7),
///    Comm::LeftOnly(&8),
/// ]);
/// ```
pub fn comm<L, R>(left: L, right: R) -> CommIterator<L, R>
where
    L: Iterator,
    R: Iterator<Item = L::Item>,
    L::Item: Ord,
    L::Item: PartialEq,
    L: FusedIterator,
    R: FusedIterator,
{
    let left_iter = left.peekable();
    let right_iter = right.peekable();

    CommIterator {
        left_iter,
        right_iter,
    }
}
Philippe-Cholet commented 5 days ago

Similarly to what we did with WithPosition, I would suggest (Comm, T) over Comm<T>. We have PutBack that can be used instead of Peekable to avoid "expect"s. Then it makes me think of MergeJoinBy.

Have you tried to use .merge_join_by(Ord::cmp)? Maybe it would be too convoluted to use?

VorpalBlade commented 5 days ago

Similarly to what we did with WithPosition, I would suggest (Comm, T) over Comm<T>.

Is this to reduce code duplication due to monomorphization? That is the only reason to prefer one over the other that I can think of.

We have PutBack that can be used instead of Peekable to avoid "expect"s.

Oh cool, we live and learn.

Then it makes me think of MergeJoinBy. Have you tried to use .merge_join_by(Ord::ord)? Maybe it would be too convoluted to use?

It was too convoluted for me to find that. I kept looking for names like "diff", "common", "comm", etc. But looking at the example for merge_join_by it seems to fit. That Both returns, well, both values instead of de-duplicating is not something I need, but it should be fine (I'll just throw one away)

Philippe-Cholet commented 5 days ago

To get the T value out of Comm<T>, we need a match so an additional method, vs .1. Maybe there was other arguments. Don't remember right now.

If it fits but is a bit convulated, then maybe we can have create a method wrapping this .merge_join_by(Ord::cmp).

VorpalBlade commented 5 days ago

To get the T value out of Comm<T>, we need a match so an additional method, vs .1. Maybe there was other arguments. Don't remember right now.

Thank you, it is always good to learn about idiomatic API design.

If it fits but is a bit convulated, then maybe we can have create a method wrapping this .merge_join_by(Ord::cmp).

Using it is not convoluted, it works very well. What was convoluted was finding it. Perhaps adding some rustdoc aliases might help people locate it. I would suggest comm and diff:

Philippe-Cholet commented 4 days ago

I did not encounter any real use of doc aliases before, that's interesting.

I see the interest of having #[doc(alias = "comm", alias = "diff")] fn merge_join_by.... But I'm not sure it's enough, we could have an additional example in the doc of merge_join_by, what do you think? Do you want to make one?

@jswrenn @phimuemue Any objection to use doc aliases?

PS: I note that doc aliases requires Rust 1.48 and we just increased the MSRV from 1.43.1 to 1.63 so it can be used!

VorpalBlade commented 4 days ago

But I'm not sure it's enough, we could have an additional example in the doc of merge_join_by, what do you think? Do you want to make one?

It seems the current first example matches my situation (though it is using an explicit closure rather than just using Ord::cmp directly (which also works), and since I have clippy::redundant_closure_for_method_calls enabled it would suggest I switch to that instead).

Maybe it could be made more obvious with a comment (since it uses a range for one of the inputs, so it is not as obvious what it is doing). Something like:

itertools::assert_equal(
    // This performs a diff in the style of the Unix command comm(1), generalised
    // to arbitrary types rather than text.
    a.merge_join_by(b, |i, j| i.cmp(j)),
    vec![Both(0, 0), Left(2), Right(3), Left(4), Both(6, 6), Left(1), Right(9)]
);

The advantage of this is that if you just do Ctrl+F and type comm or diff on the page (as opposed to using the rustdoc search feature) you will find it.

(Rust crates often have excellent documentation, except for discoverability when you don't know the term you are looking for. That is not an easy problem, and the fact that rustdoc search doesn't do full text search doesn't help. At least itertools is based on a central trait, so almost everything can be found on a single HTML page. As such Ctrl-F works okay for it. Good luck finding things in crates like regex or clap if you don't know the correct name!)

Philippe-Cholet commented 4 days ago

Ord::cmp would indeed be a bit better. I just fixed some warnings in the doctests (thanks to a link you posted above) but there is no proper way to use clippy on them yet (https://github.com/rust-lang/rust/issues/56232).

I agree that rustdoc-search is sometimes not as helpful as it could be (especially for a complex crate like clap) but we are quite centralized with the Itertools trait.

So doc aliases and a comment, fine by me.

Philippe-Cholet commented 4 days ago

@VorpalBlade Done in #966, you don't need to do it yourself.