rust-itertools / itertools

Extra iterator adaptors, iterator methods, free functions, and macros.
https://docs.rs/itertools/
Apache License 2.0
2.71k stars 309 forks source link

Add `split_unfused()` to Itertools. #883

Closed shanecelis closed 6 months ago

shanecelis commented 7 months ago

Here is the API for the proposed addition.

API doc

Split an unfused iterator into a group of iterators.

Given an "unfused iterator"---that is an iterator contains elements after next() returns None---the split_unfused() function will provide each None-terminated sequence as its own iterator. Iterators can be dropped and consumed out of order just like .group_by(). If consumed in order, no memory is allocated.

When the "unfused iterator" returns None twice consecutively, it is considered exhausted.

use itertools::Itertools;

// This iterator returns None every third element until it reaches seven.
struct Frayed(u8);
impl Iterator for Frayed {
    type Item = u8;
    fn next(&mut self) -> Option<u8> {
        self.0 += 1;
        (self.0 % 3 != 0 && self.0 <= 7).then_some(self.0)
    }
}
let split = Frayed(0).split_unfused();
let mut iters = split.into_iter();
let first = iters.next().unwrap();
let second = iters.next().unwrap();
let third = iters.next().unwrap();
assert!(iters.next().is_none());
// Iterators can be consumed out of order (but they will then allocate
// memory).
assert_eq!(third.collect::<Vec<_>>(), [7]);
assert_eq!(second.collect::<Vec<_>>(), [4, 5]);
assert_eq!(first.collect::<Vec<_>>(), [1, 2]);

Motivation

For some reason I can often write a unfused “None-terminated” iterator that represents multiple sequences. But they're unconventional and unwieldly. split_unfused() uses a lot of the machinery from chunk_by() to turn an unfused iterator into an iterator of iterators.

How's the name? The only other name I thought of was fray() but that seems not descriptive enough.

Tests

I added four tests as well.

scottmcm commented 7 months ago

Hmm, I'm not sure how I feel about having a bunch of fancy "store stuff" logic for this.

.by_ref().fuse() will give you an iterator that does the first part, then you could do that again on the original iterator after exhausting it to continue it.

How often do you need the "sometimes allocates, sometimes doesn't" part? Because if allocating is fine, might as well just collect the parts up-front, I'd think...


At a meta level, it's not clear to me that itertools should have support for unfused iterators, since they're rather acanonical, and I don't know that we'd want to implicitly encourage them.

shanecelis commented 7 months ago

Thanks for your consideration and the by_ref() tip.

I think the sometimes allocates, sometimes not is purely trying to respect the out of order iterator like chunk_by() does, which I was kind of floored by when I saw it implemented there. I probably would have just punted to panicking if someone wanted to go "backwards."

I'm happy to retract this pull request in light of the meta consideration. Maybe it can go in a deviant iterators crate unfusedtools or something.

Thanks for the great crate!

scottmcm commented 7 months ago

BTW, I'm just peanut gallery here, so wait for a real maintainer to respond before worrying about retracting :)

Philippe-Cholet commented 7 months ago

I'm no peanut gallery (he is a VIP for me really) but the new kid (latest maintainer).

I have on my TODO list (low priority) to make an issue about fusedness to clarify the situation for our iterators. One reference for me is #55.

I'm not particularly fond of using unfused iterators after it yielded None. For example, calling fold (or any other related method, see #755) consume the iterator entirely. Technically, with iter.by_ref().fold(..); iter... we can still use it, but are the internals still in a valid state? That does not seem guaranteed to me.

shanecelis commented 6 months ago

I started putting this functionality into a thing called frayed.