rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
97.98k stars 12.68k forks source link

Add `with_exact_size` and `with_size_hint` to `Iterator` #68995

Open LukasKalbertodt opened 4 years ago

LukasKalbertodt commented 4 years ago

Usually iterator adaptors do a fine job at keeping information about the length of an iterator. But once you throw something like filter in your iterator chain, the lower bound is 0, meaning that collecting into a collection can't properly pre-allocate. Also, some iterator adaptors (like chain) cannot implemented ExactSizeIterator although in many real world cases, the size is perfectly known.

Currently, working around those limitations is often tedious and requires quite a bit of additional code. I think it would be a lot nicer to just add a .with_exact_size(27) somewhere into your chain. So I'd like to propose adding the following to Iterator:

I would have created this as a PR, but I'm short on time and wanted to hear some opinions first.

hexane360 commented 4 years ago

Should ExactSize panic if the underlying iterator fails to yield exactly size values? Should SizeHint panic if the underlying iterator fails to yield at least lower values?

Centril commented 4 years ago

I recall suggesting this to @scottmcm, who was not a fan, in a private setting at one point. cc also @jswrenn re. https://github.com/rust-itertools/itertools/pull/341.

LukasKalbertodt commented 4 years ago

Should ExactSize panic if the underlying iterator fails to yield exactly size values? Should SizeHint panic if the underlying iterator fails to yield at least lower values?

I initially would have said: no, they should not panic. len() says it has the same safety guarantees as size_hint and size_hint stresses that it's basically just for optimization and nothing bad should happen if the returned size is incorrect.

But thinking about it again... maybe checking this might not actually incur any notable performance overhead and be useful. Lets consider the implementation for ExactSize:

struct ExactSize<I> {
    inner: I,
    size: usize,
}

impl<I: Iterator> Iterator for ExactSize<I> {
    type Item = I::Item;

    fn next(&mut self) -> Option<Self::Item> {
        match self.inner.next() {
            None => None,
            Some(x) => {
                self.size -= 1;
                Some(x)
            }
        }
    }

    fn size_hint(&self) -> (usize, Option<usize>) {
        (self.size, Some(self.size))
    }
}

This implementation does not contain those checks. Well, in debug mode, the integer underflow is checked. Which is good, as an underflow would usually result in "no you can't allocate a gazillion GB". This would be next() with those checks:

match self.inner.next() {
    None => {
        assert_eq!(self.size, 0);
        None
    }
    Some(x) => {
        self.size = self.size.checked_sub(1).unwrap();
        Some(x)
    }
}

The None case is a cold path, so the additional check there probably doesn't matter. The checked_sub in the Some branch might be fine, too.

Well, this is certainly something we should benchmark and test before stabilization.

the8472 commented 4 years ago

For maximal optimization you'd need some sort of unsafe fn set_trusted_len() that provides a TrustedLen wrapper.

scottmcm commented 4 years ago

size_hint is pretty situational, and it turns out that it's half-unused anyway. And while .with_size_hint(low, high) is of course safe, it's also really easy to get into a garbage-in-garbage-out situation with because an iterator doesn't need to work if the size_hint is wrong -- which is easy to happen if you do something like "the predicate in my filter is almost always true so I'll just hint that it's the same size as the input". Additionally, SizeHint<I> cannot propagate TrustedLen, so something like (0..100).with_exact_size(100) looks like you're being helpful but actually makes things worse.

So I've been more and more thinking that what's really needed is a different iterator API that's allowed to be wrong, but where being wrong just over- or under-allocates, rather than permitting incorrect behaviour. Then things like ResultShunt can say "you should allocate enough for everything", even though it might return fewer things if there's an Err. And if it's just one number, it could just be usize::saturating_added for Chain or whatever, rather than the current more-complicated dance. It has a useful default implementation of self.size_hint().0 as well, so we could move collect to using it immediately without regressing third-party iterators.

(I also still like .collect_into(Vec::with_capacity(100)), or similar, from https://github.com/rust-lang/rust/issues/45840.)