rust-itertools / itertools

Extra iterator adaptors, iterator methods, free functions, and macros.
https://docs.rs/itertools/
Apache License 2.0
2.72k stars 309 forks source link

Merge `MergeBy` and `MergeJoinBy` #701

Closed phimuemue closed 1 year ago

phimuemue commented 1 year ago

It seems that merge_join_by/MergeJoinBy is actually a more generic variant of merge/MergeBy.

However, currently, the MergeJoinBy and MergeBy are implemented independently (PutBack vs Peekable, fused behavior, requiring Ordering vs bool). The past showed that uniform implementations may be better.

This change should not introduce breaking changes for users: merge, merge_by and merge_join_by should stay usable as they are.

However, I'd imagine that the general merge_join_by could also accept a comparator returning bool (instead of Ordering) - this would free the users from having to deal with the EitherOrBoth::Both case. I.e. I suggest that users should be able to write:

use itertools; // 0.10.5
use itertools::{Either, EitherOrBoth};

fn main() {
    let numbers = [1,2,3,4,5];
    let strings = ["one", "two", "three", "four"];

    for x in itertools::merge_join_by(
        numbers,
        strings,
        |number, string| number.cmp(&string.len()),
    ) {
        match x {
            EitherOrBoth::Left(number) => {dbg!(number);},
            EitherOrBoth::Right(string) => {dbg!(string);},
            EitherOrBoth::Both(number, string) => {dbg!(number, string);},
        };
    }

    for x in itertools::merge_join_by(
        numbers,
        strings,
        |number, string| number < &string.len(), // <-- Note that we return bool instead of Ordering
    ) {
        match x {
            Either::Left(number) => {dbg!(number);},
            Either::Right(string) => {dbg!(string);},
            // <-- Note that we do not need to deal with the "Both" case, as our comparator returns bool
        };
    }
}

When we tackle this, we should - as a first step - probably move all the merge-related functions into an own module.

Philippe-Cholet commented 1 year ago

I experimented a bit and what you want is for MergeJoinBy<I, J, F> to have:

It's conflicting. But then if we add a type T to "MergeJoinBy" (I think it's a small breaking change):

pub struct MergeJoinBy<I: Iterator, J: Iterator, F, T>
    where F: FnMut(&I::Item, &J::Item) -> T
{ ... }

then we can have

impl<I, J, F> Iterator for MergeJoinBy<I, J, F, Ordering> // EitherOrBoth
impl<I, J, F> Iterator for MergeJoinBy<I, J, F, bool> // Either
Does it seems good to you? (changes in "src/merge_join.rs" and very little in "src/lib.rs"). EDIT: I can avoid duplication with a macro. EDIT 2: A bit unsure about what variant it should return in our new case: cmp_fn returning output variant
Ordering::Less EitherOrBoth::Left
Ordering::Greater EitherOrBoth::Right
Ordering::Equal EitherOrBoth::Both
false Either::Right ?
true Either::Left ?
phimuemue commented 1 year ago

Hi there. I'm not sure if I follow 100% wrt "conflicting". I think MergeJoinBy should not require the function's return type explicitly, because it can be inferred from F.

As such, I'd imagine this:

impl<I, J, F, ResultOfF> Iterator for MergeJoinBy
    where
        F: FnMut(&I::Item, &J::Item) -> ResultOfF,
        ResultOfF: OrderingOrBool<I::Item, J::Item>
{
    type Item = ResultOfF::Item;
}

trait OrderingOrBool<IItem, JItem> {
    type Item
}
impl<IItem, JItem> OrderingOrBool<IItem, JItem> for std::cmp::Ordering {
    type Item = EitherOrBoth<IItem, JItem>;
}
// similar for bool with Item = Either

I am not sure if this design hits an unsovable obstacle somewhere, but for me it looks a bit clearer than basically copy-pasting/macro-generating two impls for bool and Ordering separately. Did you research this possibility and run into problems?

Philippe-Cholet commented 1 year ago

I agree about not conflicting for the reason you mention, I was just being careful. TBH, I did not think of this trait-way, I like it. After giving it a try, it works with a few additions.

pub trait OrderingOrBool<I, J> {
    type Item;
    // All with `#[inline(always)]` in both impl
    fn into_cmp(self) -> Ordering;
    fn left(left: I) -> Self::Item;
    fn right(right: J) -> Self::Item;
    fn both(left: I, right: J) -> Self::Item;
}

To match against the result of the user function, I needed into_cmp (for bool, if self { Ordering::Less } else { Ordering::Greater }). I added "left/right/both" methods to be able to create variants ("both" for bool only contains unreachable!("into_cmp never returns Ordering::Equal so this should never be called")).

phimuemue commented 1 year ago

Wow, that was fast. - Nice that it seems to work.

As for the additional methods: Of course, I didn't see the whole implementation, but it sounds reasonable.

Philippe-Cholet commented 1 year ago

Here is a commit for it. Seems reasonable to me too. I'm currently blocked about merging implementations of MergeJoinBy and MergeBy. Maybe you have an idea on what to do. Or I will try again.

Philippe-Cholet commented 1 year ago

First, I moved MergeBy into its own module "merge" as you suggested.

To merge implementations of "MergeBy" and "MergeJoinBy", I mostly struggled with the type "Merge" (how to discard "MergePredicate" and "MergeLte" for I::Item::le). Here, type Merge<I, J> = MergeBy<I, J, MergeLte>; becomes type Merge<I, J> = MergeBy<I, J, fn(&<I as Iterator>::Item, &<J as Iterator>::Item) -> bool>;. Then trait MergePredicate<T> and struct MergeLte can be removed. Finally, MergeBy<I, J, F> basically wraps MergeJoinBy<I, J, F, bool>. So in impl Iterator, I use Either::into_inner to convert Either<T, T> to T.

If this is satisfying, should I merge files "merge_join.rs" into the newly created "adaptors/merge.rs" ?

EDIT: In "OrderingOrBool" trait, I added fn size_hint(left: SizeHint, right: SizeHint) -> SizeHint; to fix the small size_hint issue mentionned above.

phimuemue commented 1 year ago

Thank you for staying in touch. I see there are some valid questions, so I suggest to first focus on supporting bool (since that's probably what's needed anyway if we want to merge MergeBy and MergeJoinBy, i.e. let's focus on your first commit (https://github.com/Philippe-Cholet/itertools/commit/5ad6f70e46c0e2404f86c5eaeed800adec27d9cc).

Please see my comments there for specifics, but probably most important: Is the fourth type parameter in MergeJoinBy really needed? (I would guess "no", because it could be inferred from F.)

In addition, this commit introduced several trait bounds that look unneeded to me. We should add only the trait bounds required to support bool and these that keep the code in line with what's already there.

The idea of your transformation (i.e. having left, right, both as customization points) looks correct, but the unreachable makes me think if there's a way to refactor this to get rid of it. Did you try something in this direction?

As said, let's keep it at supporting bool for now, but here's some thoughts regarding your questions anyway:

"MergeJoinBy" don't have impl FusedIterator while "MergeBy" did. Not sure what to do here.

I think we should fuse the input iterators (i.e. adapt MergeJoinBy's strategy and document this. (Sidenote: That's exactly the type of problem that would be avoided if the structs had been merged.)

(MergeJoinBy<I, J, F, bool> and) MergeBy<I, J, F> lost some size_hint information because "EitherOrBoth" sometimes use one element of each iterator in one item (Both variant) while "Either" does not. Unless I find a fix.

Good catch. Maybe that precludes merging the structs... We'll have to look into that.

Should I #[inline] methods in impl Iterator for MergeBy ?

I think most of the time we rely on the compiler, but afaik there are no definitive guidelines about [inline].

Philippe-Cholet commented 1 year ago

You might have missed (or not) an edit of my comment, a fix about size_hint, seems simple enough to me. Instead of creating a trait for Ordering/bool, maybe we can create a trait for FnMut(&I, &J) -> T, I'll come back to you.

Philippe-Cholet commented 1 year ago

I did not find a way to have only three types. I was going with a trait MergePredicate (idea from how MergeBy currently works)

pub trait MergePredicate<I, J> { // or <I, J, T>
    type Item;
    // fn merge_pred(&mut self, left: &I, right: &J) -> T; // return self(left, right)
    fn merge(&mut self, left: I, right: J) -> (Option<I>, Option<J>, Self::Item);
    // (item to put back left, same for right, next_item)
    // then I can match against Ordering/bool variants, and don't need one unreachable both method.
    fn left(left: I) -> Self::Item;
    fn right(right: J) -> Self::Item;
}

With MergePredicate<I, J>, I get an error about "conflicting implementations".

impl<I, J, F: FnMut(&I, &J) -> Ordering> MergePredicate<I, J> for F { ... }
impl<I, J, F: FnMut(&I, &J) -> bool> MergePredicate<I, J> for F { ... }

I suppose it worries that there is a type out there implementing both FnMut(&I, &J) -> Ordering and FnMut(&I, &J) -> bool. Seems impossible but it's an error. With MergePredicate<I, J, T> (T being Ordering or bool), this problem vanishes but I get another, a weird one IMO about:

impl<I, J, F, T> Iterator for MergeJoinBy<I, J, F>
    where I: Iterator,
          J: Iterator,
          F: MergePredicate<I::Item, J::Item, T>,
{ ... }

error[E0207]: the type parameter T is not constrained by the impl trait, self type, or predicates

The constraint on F does not constrain T at all apparently.

Hence I add T as a fourth type to MergeJoinBy. Only to get that it's not used (add a PhantomData<T> field leads to a compiler error in tests about "type annotation", even with constraint F: MergePredicate<I, J, T>) so with F: FnMut(&I::Item, &J::Item) -> T as a constraint on MergeJoinBy, then it works with MergePredicate<I, J, T> being a super trait of FnMut(&I, &J) -> T! Harsh right?!

So I think T should be inferred from F as you guessed but not by the current compiler or this trait way. It solves the issue you have with unreachable. I'll make a commit after dinner for you to see.

Philippe-Cholet commented 1 year ago

Here is a new commit (in a new branch) with trait MergePredicate<I, J, T>: FnMut(&I, &J) -> T (see previous comment). It better matches Ordering & bool variants. And merge_join_by constraints seems nicer ? If you don't want the previous OrderingOrBool or this MergePredicate, I'm open to suggestions. Not a priority at this moment but either way, there is some code duplication about putting back right/left elements.

Philippe-Cholet commented 1 year ago

Following the last commit, add a PhantomData field on MergeJoinBy would allow me to remove F: FnMut(...) -> T constraint on the struct definition, impl Clone and impl Debug of MergeJoinBy, all three that were added in the last commit, and MergePredicate does not need to be a supertrait. So the new PhantomData commit but you might prefer to see the diff if these last two were squashed: comparison here (the changes on constraints are way more natural/minimalist).

phimuemue commented 1 year ago

Thanks for the update. I like how you were able to avoid the unreachable. I would assume that the compiler is smart enough to inline fn merge and avoid the temporary Options.

I still think it should be possible without T in MergeJoinBy - see this prototype: https://github.com/phimuemue/rust-itertools/commit/072fe00e666f3d6fa5616af3176e2ae6e294516c. (I consider it important to not change MergeJoinBy's interface (if not strictly required), as it right now is already complex enough.)

Looking at this prototype, it seems that MergePredicate is not even needed anymore. If you want to continue working on this, you can use the prototype as a starting point, and get rid of the rough edges.

Philippe-Cholet commented 1 year ago

So I got rid of MergePredicate there. The diff between before the issue and now, it feels right.

I could avoid code duplication (code below is in four Iterator methods), or not (it's not that bad).

if let Some(left) = left { self.left.put_back(left); }
if let Some(right) = right { self.right.put_back(right); }

I could work on bringing struct MergeBy into src/merge_join.rs or another place ?

phimuemue commented 1 year ago

I'd say we should focus on supporting bool before moving stuff around, especially in light of https://github.com/rust-itertools/itertools/issues/702.

Philippe-Cholet commented 1 year ago

Use MergeJoinBy for the internal of MergeBy to merge their implementation in a single file is quite easy anyway.

All four points are done here (and small stupid duplicates removed). Although, I'm not sure the documentation sentence about "less" is clear enough. I changed the common example to show that the Both variant does not always split into consecutives Left, Right variants (when one of the two iterators is not ascending, it's not a precondition after all).

Philippe-Cholet commented 1 year ago

I should not have unwrapped Either in the first quickcheck (commit).

Philippe-Cholet commented 1 year ago

I tried with put_backs in merge, and while its definition certainly would be not pretty, the rest would be straightforward (just let elem = T::merge(&mut self, left, right);).

pub trait OrderingOrBool<L, R> {    // L R for types, I J for iterators below
    type MergeResult;
    fn left(left: L) -> Self::MergeResult;
    fn right(right: R) -> Self::MergeResult;
    fn merge<I, J, F>(it: &mut MergeJoinBy<I, J, F>, left: L, right: R) -> Self::MergeResult
        where I: Iterator<Item = L>,
              J: Iterator<Item = R>,
              F: FnMut(&L, &R) -> Self;    // match/if  (it.cmp_fn)(&left, &right)  {...}
    fn size_hint(left: SizeHint, right: SizeHint) -> SizeHint;
}
phimuemue commented 1 year ago

Hi, thanks so far. I realized we should keep the inners of OrderingOrBool private (so that we can change them later). We could employ a similar technique to HomogenousTuple which is a facade for TupleCollect.

Philippe-Cholet commented 1 year ago

OrderingOrBool is still private (we can't access it from outside itertools) because in "lib.rs", the module "merge_join" is not public, and OrderingOrBool is not pub used anywhere else. But HomogenousTuple is public (I can use itertools::traits::HomogeneousTuple; in a personal project).

phimuemue commented 1 year ago

Oh, cool. Maybe open a PR so we can have a last review?

Philippe-Cholet commented 1 year ago

Now that the PR is merged, we can merge implementations as you first wanted. The way I see it, trait MergePredicate<T> and struct MergeLte are useless (being respectively replaced by FnMut(&T, &T) -> bool and PartialOrd::le).

pub type Merge<I, J> = MergeBy<I, J, MergeLte>;
// becomes
pub type Merge<I, J> = MergeBy<I, J, fn(&<I as Iterator>::Item, &<J as Iterator>::Item) -> bool>;

Then

pub fn merge<I, J>(i: I, j: J) -> Merge<<I as IntoIterator>::IntoIter, <J as IntoIterator>::IntoIter>
    where I: IntoIterator,
          J: IntoIterator<Item = I::Item>,
          I::Item: PartialOrd
{
    // merge_by_new(i, j, MergeLte)
    // becomes
    merge_by_new(i, j, PartialOrd::le)
}

And MergeBy uses MergeJoinBy in its internal

pub struct MergeBy<I, J, F>
    where I: Iterator,
          J: Iterator<Item = I::Item>
{
    iter: MergeJoinBy<I, J, F>,
}

The rest is really straightforward. I know it will pass tests as I experimented before. If that seems reasonable to you then my question is where do we move all the code, in which file? Or we move it a last step? Or as a first?

phimuemue commented 1 year ago

Hi @Philippe-Cholet , I think this design has two drawbacks:

Without having looked at the code too closely, I think we should express the affected iterators all as instances of MergeJoinBy, and customize MergeJoinBy essentially by one parameter that

This could be realized by extending OrderingOrBool.

Philippe-Cholet commented 1 year ago

I kinda expected it about fn. About iter, I thought about it and I didn't think it was an issue because I don't think it would make sense to implement more than next/size_hint/count/last/nth. I was more annoyed by Either::into_inner decapsulation (see below, straightforward). I'm gonna try going your way.

impl<I, J, F> Iterator for MergeBy<I, J, F>
    where I: Iterator,
          J: Iterator<Item = I::Item>,
          F: FnMut(&I::Item, &I::Item) -> bool,
{
    type Item = I::Item;
    fn next(&mut self) -> Option<Self::Item> {
        self.iter.next().map(Either::into_inner)
    }
    fn size_hint(&self) -> SizeHint {
        self.iter.size_hint()
    }
    fn count(self) -> usize {
        self.iter.count()
    }
    fn last(self) -> Option<Self::Item> {
        self.iter.last().map(Either::into_inner)
    }
    fn nth(&mut self, n: usize) -> Option<Self::Item> {
        self.iter.nth(n).map(Either::into_inner)
    }
}
Philippe-Cholet commented 1 year ago

After moving Merge/MergeBy into merge_join module to manipulate them all easily, I'm just really stuck.

To be able to keep MergeLte and not use PartialOrd::le (and therefore not a function pointer in Merge type), we can't keep F: FnMut(&I::Item, &J::Item) -> T in impl Iterator for MergeJoinBy. But use something like F: MergePredicate<I::Item, J::Item, T> instead does not constrain T anymore leading to introduce T back in MergeJoinBy<I, J, F, T> with a phantom field. I don't see a way around this.

Both MergeBy and MergeJoinBy use functions returning bool so in order for MergeBy to be an instance of MergeJoinBy, the only way I saw was adding another parameter (const SAME: bool (meaning L == R ?), a const one hence breaking the MSRV) to distinguish them. You can see the changes here. I'll revert them if you can't salvage any of this.

Philippe-Cholet commented 1 year ago

About F: MergePredicate<I::Item, J::Item, T> not contraining T, the answer was F: MergePredicate<I::Item, J::Item, Out = T>. Then no phantom field. Commit here.

Philippe-Cholet commented 1 year ago

I think I'm finally on the right path!

To get rid of const SAME: bool, I wrap the function F in MergeFuncLR (result wrapped EitherOrBoth/Either<L,R>) and MergeFuncT (result unwrapped T). Those MergeFuncLR/T have a parameter T (phantom field) to not have conflicting implementations of MergePredicate<L, R>.

In merge_join_by definition, in order for the compiler to keep guessing left/right types, I kept F: FnMut(&Self::Item, &J::Item) -> T. But to not add F: MergePredicate<Self::Item, J::Item> that felt duplicate, I removed T: OrderingOrBool..., the user might lose OrderingOrBool information.

Commit there but you might prefer to see the current code.

phimuemue commented 1 year ago

Hi there, thanks for going through this. I do not have too much time right now, so just quick feedback:

I think this is indeed taking the right turns! I skimmed the code, and I saw that you have four merge predicates:

Looks simple and good. Maybe we can simplify some MergeFuncT/MergeFuncLR internals (does MergeFuncT need its second parameter?), but the scaffold so far seems right.

I think that MergeJoinBy it should impl FusedIterator, shouldn't it?

Philippe-Cholet commented 1 year ago

Thanks for the quick feedback.

I only commented out FusedIterator temporarily, I definitely think MergeJoinBy should implement it (commit) like many other iterators. After we solve the current issue, I'll probably make an issue (or use an old one if any) as I investigated it.

Good catch, MergeFuncT<F, bool> does not need the second (phantom) parameter since T always is bool so MergeFuncT<F> is enough. Therefore new commit in which I also simplify MergeFuncLR and MergeFuncT to "tuple structs".

Note 1: But MergeFuncLR needs it to avoid conflicting implementations of MergePredicate<L, R>.

impl<L, R, F: FnMut(&L, &R) -> Ordering> MergePredicate<L, R> for MergeFuncLR<F/*, Ordering*/> { ... }
impl<L, R, F: FnMut(&L, &R) -> bool> MergePredicate<L, R> for MergeFuncLR<F/*, bool*/> { ... }

Some impossible F could implement both FnMut(&L, &R) -> Ordering and FnMut(&L, &R) -> bool. Annoying but unavoidable IMO.

Note 2: impl<...> MergePredicate<T, T> for MergeFuncT<MergeLte> is mostly copy/paste of impl<...> MergePredicate<T, T> for MergeFuncT<F> with self.0(&left, &right) replaced by left <= right.

EDIT: I could write a new fn merge_join_by_new<I: IntoIterator, J: IntoIterator, F>(left: I, right: J, f: F) -> MergeJoinBy<I::IntoIter, J::IntoIter, F> to avoid MergeJoinBy { left: put_back(left.into_iter().fuse()), right: put_back(right.into_iter().fuse()), cmp_fn: ..., } three times when only cmp_fn changes.

Philippe-Cholet commented 1 year ago

From its trait documentation, we should implement FusedIterator for all iterators that always return None once they've done it once. But the real question IMO is on which conditions?!

Here, I think I made a mistake (being too fast). MergeJoinBy fuses both left and right iterators even if they were not before. So I think we should

impl<I, J, F, T> FusedIterator for MergeJoinBy<I, J, F>
    where I: Iterator,
          J: Iterator,
          F: MergePredicate<I::Item, J::Item, Out = T>,
{}

instead of

impl<I, J, F, T> FusedIterator for MergeJoinBy<I, J, F>
    where I: FusedIterator,
          // ~~~~~
          J: FusedIterator,
          // ~~~~~
          F: MergePredicate<I::Item, J::Item, Out = T>,
{}
phimuemue commented 1 year ago

Sounds right.

Philippe-Cholet commented 1 year ago

FusedIterator' small fix. Currently: lib.rs, merge_join.rs, big diff.


In merge_join_by definition, in order for the compiler to keep guessing left/right types, I kept F: FnMut(&Self::Item, &J::Item) -> T. But to not add F: MergePredicate<Self::Item, J::Item> that felt duplicate, I removed T: OrderingOrBool..., the user might lose "OrderingOrBool information".

I could

pub trait IsOrderingOrBool {}
impl IsOrderingOrBool for Ordering {}
impl IsOrderingOrBool for bool {}
// then add to the function definition
    T: merge_join::IsOrderingOrBool,

I think that would be clearer to people. What's your thought on that?

Then should I open a pull request for review, or do you see changes to do?

Philippe-Cholet commented 1 year ago

@phimuemue Little reminder.

phimuemue commented 1 year ago

Hi, thanks for keeping in touch.

What I see:

If all this passes our tests, can you open a PR?

Philippe-Cholet commented 1 year ago

Thank you too for this long discussion, I initially thought it would be quickier.

I don't like MergeFuncLR to be public either, so I thought of this InternalMergeJoinBy but what it means for MergeJoinBy is a fourth type (pub type MergeJoinBy<I, J, F, T> = InternalMergeJoinBy<I, J, MergeFuncLR<F, T>>;) which we previously tried to avoid. Because we can't write something like <F as FnMut(...)>::Output to infer T from F (from what I understand because Fn* traits aren't stabilized).

I don't see any way to avoid it, but maybe you do. Currently, Philippe-Cholet@c622606fa4c9d6ca85b485e6bd7cc8004a249e24 . I'll make a PR once you find a way or are ok with the 4th type.

phimuemue commented 1 year ago

I think I'm fine with InternalMergeJoinBy - I agree it's a bit ugly, but the ugliness is not leaked too much to the user. And I hope that - once we find something better - we can clean up the internals.