In-place iteration results in too big allocations

rust-lang / rust

Empowering everyone to build reliable and efficient software.

https://www.rust-lang.org

Other

98.03k stars 12.68k forks source link

In-place iteration results in too big allocations #120091

Closed TethysSvensson closed 5 months ago

TethysSvensson commented 9 months ago

(This bug report was inspired by this blog post https://blog.polybdenum.com/2024/01/17/identifying-the-collect-vec-memory-leak-footgun.html)

After #110353 was landed, in-place iteration can reuse allocations in many more places. While this is a good thing, in some cases this can cause overly large capacity for the destination vector.

Additionally, in some this will cause the destination vector to have a very large capacity even for non-shared allocations. In my opinion, this should never happen.

For an example see this code:

fn dbg_vec<T>(v: &Vec<T>) {
    println!(
        "vec data ptr={:?} len={} cap={}",
        v.as_ptr(),
        v.len(),
        v.capacity()
    );
}

fn main() {
    let v1 = (0u16..128).map(|i| [i; 1024]).collect::<Vec<_>>();
    dbg_vec(&v1);
    let v2 = v1.into_iter().map(|x| x[0] as u8).collect::<Vec<_>>();
    dbg_vec(&v2);
}

On stable this code works as expected, i.e. two different vectors with reasonably lengths and capacities.

On beta however, v2 will have a capacity of 262144, even though it does not share an allocation with v1. If you remove the as u8 part, then the allocations will be shared, but the capacity is still overly large.

My suggested fix is:

Do not attempt to use in-place collection if the destination alignments do not match, as reallocation will always be necessary. On my machine, these reallocations do not appear to return the samme allocation, so we might as well allocate up front to avoid a memcpy.
The behavior regarding capacities in the remaining cases needs to either change or be documented somewhere.

Meta

I am running NixOS with rustup.

$ rustup --version --verbose
rustup 1.26.0 (1980-01-01)
info: This is the version for the rustup toolchain manager, not the rustc compiler.
info: The currently active `rustc` version is `rustc 1.77.0-nightly (6ae4cfbbb 2024-01-17)`
$ rustc +stable --version --verbose
rustc 1.75.0 (82e1608df 2023-12-21)
binary: rustc
commit-hash: 82e1608dfa6e0b5569232559e3d385fea5a93112
commit-date: 2023-12-21
host: x86_64-unknown-linux-gnu
release: 1.75.0
LLVM version: 17.0.6
$ rustc +beta --version --verbose
rustc 1.76.0-beta.5 (f732c37b4 2024-01-12)
binary: rustc
commit-hash: f732c37b4175158d3af9e2e156142ffb0bff8969
commit-date: 2024-01-12
host: x86_64-unknown-linux-gnu
release: 1.76.0-beta.5
LLVM version: 17.0.6

the8472 commented 9 months ago

On my machine, these reallocations do not appear to return the samme allocation, so we might as well allocate up front to avoid a memcpy.

The pointer not being the same does not necessarily mean the allocation hasn't been reused. The allocator can call mremap to move it to a new virtual address which has a much lower cost than actually copying the bytes or faulting in new pages. This may also depend on the allocation size. For smaller pool-based allocations it might be an actual memmove and a mremap for larger ones.

The behavior regarding capacities in the remaining cases needs to either change or be documented somewhere.

Well, this is a non-guaranteed optimization so we can't document the exact behavior. But it probably makes sense to mention that something like this may happen.

Additionally, in some this will cause the destination vector to have a very large capacity even for non-shared allocations. In my opinion, this should never happen.

That's intended to allow vec-allocation recycling even when the type changes, which can mean the excess factor is infinite.

TethysSvensson commented 9 months ago

On my machine this does result in a memmove. This is the strace output of an execution:

29839 mmap(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f50437bd000
29839 write(1, "vec data ptr=0x7f50437bd010 len="..., 44) = 44
29839 mmap(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f504377c000
29839 munmap(0x7f50437bd000, 266240)    = 0
29839 write(1, "vec data ptr=0x7f504377c010 len="..., 47) = 47
29839 munmap(0x7f504377c000, 266240)    = 0

TethysSvensson commented 9 months ago

This extra capacity also has negative performance impacts:

fn test_func() {
    let v1 = (0u16..0x4000).map(|i| [i; 32]).collect::<Vec<_>>();
    let v2 = v1.into_iter().map(|x| x[0] as u8).collect::<Vec<_>>();
    // Prevent the optimizer from removing the allocation
    assert_eq!(v2[0], 0);
}

fn main() {
    let mut now = std::time::Instant::now();
    for _ in 0..10_000 {
        test_func();
    }
    println!("{}", now.elapsed().as_millis());
}

This code takes more than 10 times as long on my machine if it was compiled with the beta compiler rather than the stable one.

kevincox commented 9 months ago

It seems that much like capacity growth has a scaling factor there should be some sort of limit on the unused capacity that will be used for optimizations like this. Maybe there should be a general rule for Vec such as "The capacity will never automatically grow more than 2x (or some other number) the maximum elements that have been part of the Vec since the last explicit capacity operation."

So the manual capacity operations still work, but automatic capacity growth and allocation will not use more than some constant factor of the in-use memory.

It seems obvious to me that some rule is required here. If the next version of rust made push allocate a thousand times extra capacity it would "break" many programs. So there is an implicit contract here. It may make sense to make it explicit. IDK what the number we should promise is, maybe 2x or 4x. But it would be nice to write down some guarantee that can be relied on.

If we have that rule spelt out then we can decide if this operation should follow the rule, or have some sort of exception. But I think it should probably follow it. (Or maybe have a slightly relaxed threshold.)

the8472 commented 9 months ago

It seems obvious to me that some rule is required here.

Not necessarily. The capacity has already been allocated, so the cost has been paid. And vecs never shrink by themselves, even if you pop from them. So there's no guarantee that you always get minimal possible memory footprint.

kevincox commented 9 months ago

I agree that this is a bit of a weird case, but I would argue that from the user's perspective this is a "new Vec". So it is somewhat surprising that it is holding on to arbitrary amounts of memory from another vector. It would be nice to provide some guarantees.

even if you pop from them. So there's no guarantee that you always get minimal possible memory footprint.

Please not that in my rule I said "maximum elements since ...". This would allow the current behaviour of pop not freeing memory. In other words my rule only controls growth, it never implies that the Vec will shrink other than explicit capacity operations such as shrink_to_fit.

majaha commented 9 months ago

It seems obvious to me that some rule is required here.

Not necessarily. The capacity has already been allocated, so the cost has been paid. And vecs never shrink by themselves, even if you pop from them. So there's no guarantee that you always get minimal possible memory footprint.

The cost of an allocation is over it's entire lifetime: It's memory cannot be used by something else.

I pretty much agree with @kevincox and was about to post something similar: There should be a note in the docs for .collect<Vec<_>() that states that the resulting capacity can never be more than twice the length. (Or wherever in the docs that would cover any other similar cases)

the8472 commented 9 months ago

"surprising" is a matter of expectations. Different people have different expectations. E.g. some people have expressed surprise when collect doesn't reuse allocations.

And as I have mentioned in a previous comment using a factor-based heuristic doesn't help if you're intentionally doing something like vec_t.filter_map(|_| None).collect::<Vec<U>> to keep the allocation. Allocation recycling has been requested before.

There should be a note in the docs for .collect<Vec<_>() that states that the resulting capacity can never be more than twice the length.

Vec does not guarantee that anywhere for any operation. Quite the opposite:

Vec does not guarantee any particular growth strategy when reallocating when full, nor when reserve is called. The current strategy is basic and it may prove desirable to use a non-constant growth factor. Whatever strategy is used will of course guarantee O(1) amortized push.

kevincox commented 9 months ago

Small note: I think that collect::<Vec<_>> should probably have some sort of exception for size_hint. For example if size_hint says a minimum size of 1k elements but then ends after 2 I think it is acceptable that the capacity can be 500x the actual used space. But I would document this as an exception to the general rule since this it only occurs if the implementation of size_hint is broken.

the8472 commented 9 months ago

If your size hint is incorrect we could just panic. You're breaking an API contract. It's called "hint" because it's not required to be tight, not because you're allowed to return garbage.

kevincox commented 9 months ago

I don't think anyone is claiming that there is currently a guarantee. Personally I am claiming that there should be a guarantee because users are relying on some sort of reasonable memory usage, and it would be nice to actually promise that.

kevincox commented 9 months ago

if you're intentionally doing something like vec_t.filtermap(|| None).collect::<Vec> to keep the allocation.

Maybe this would best be provided by a dedicated API rather than this weird looking trick? This would make it explicit and obvious to readers.

the8472 commented 9 months ago

Well, there's comments here and there telling people to use just that if they want the recycling behavior... so you could say people also rely on that behavior and have different expectations.

jhorstmann commented 9 months ago

Looking at the assembly for the following two functions

pub fn test_array_u8(input: Vec<[u8; 8]>) -> Vec<u8> {
    input.into_iter().map(|x| x[0] as u8).collect()
}

pub fn test_array_u16(input: Vec<[u16; 4]>) -> Vec<u8> {
    input.into_iter().map(|x| x[0] as u8).collect()
}

It seems to me that on beta, the first one (where the element type matches the result type), works completely in place without allocations. So here the resulting capacity being bigger would be expected.

The second function also first calculates the result in place, but then seems to do an allocation and copying of the data. So there might be a bug or missed optimization in the size calculation for this new allocation.

With rust 1.75, both look like they first allocate a new location for the result.

the8472 commented 9 months ago

That's unexpected. It should be calling into realloc, not memcpy

LegionMammal978 commented 9 months ago

The pointer not being the same does not necessarily mean the allocation hasn't been reused. The allocator can call mremap to move it to a new virtual address which has a much lower cost than actually copying the bytes or faulting in new pages. This may also depend on the allocation size. For smaller pool-based allocations it might be an actual memmove and a mremap for larger ones.

This is not really relevant, since it's impossible to reuse an allocation with a different alignment in stable Rust. <System as Allocator>::shrink() always falls back to allocate() + copy_nonoverlapping() + deallocate() if the alignments aren't exactly the same. Thus, a custom Allocator implementation would be necessary for shrink() to even plausibly change the alignment.

the8472 commented 9 months ago

Yeah I think I was testing with jemalloc.

Voultapher commented 9 months ago

I agree with others in this thread. The behavior is quite surprising to me, which is generally a bad thing for a language that advertises itself as reliable. Even more so if you plan on building system abstractions with it. I think at the very least there has to be an upper limit to the reuse. Anything above 2x seems like a footgun to me. As the original post author encountered already.

Also the property that it will re-use for different alignments also seems quite surprising to me.

jhorstmann commented 9 months ago

That's unexpected. It should be calling into realloc, not memcpy

Might be the default implementation of the global allocator for shrink. Since the alignment is different it ends in the default branch and has to allocate/copy/deallocate.

Istvan91 commented 9 months ago

Might be the default implementation of the global allocator for shrink. Since the alignment is different it ends in the default branch and has to allocate/copy/deallocate.

Which seems unnecessary if the target type is overaligned? I think this should only be necessary when the target type is under/mis-aligned. Which is never true if the target type is smaller than the source type.

LegionMammal978 commented 9 months ago

Which seems unnecessary if the target type is overaligned? I think this should only be necessary when the target type is under/mis-aligned. Which is never true if the target type is smaller than the source type.

libc allocators generally don't provide an aligned_realloc() to go alongside posix_memalign() or aligned_alloc(). Thus, any allocation with an alignment greater than the guaranteed malloc() alignment has to be manually reallocated via memcpy(). This is implemented as the realloc_fallback() function in library/std/src/sys/common/alloc.rs.

matthieu-m commented 9 months ago

@the8472

First of all, I'd like to say that the core idea of this optimization is a good one. In general, mapping in place is much less costly than allocating, mapping across, then deallocating the old allocation. Avoiding a new allocation and avoiding the resulting cache traffic is good.

With that being said, just because the idea is good in general does not mean that it may never be unhelpful in some cases, and the case demonstrated in the blog post, where over 100x the necessary memory is retained, is definitely unhelpful.

The documentation could be improved, but:

Users don't read the documentation.
There's a lot of collect in the wild, written prior to this optimization, which may suddenly start exhibiting such an unhelpful behavior.

So, let's focus on expectations:

Is it reasonable for a user to complain if the memory wasn't reused? I'd argue YES. It seems reasonable, and I'm glad this optimization helps doing so.
Is it reasonable for a user to complain if the resulting Vec didn't preserve the 100x excess capacity? I'd argue NO. This is way beyond the usual Exponential Growth pattern, and usually only occurs when explicit steps are taken -- reserve, or popping elements out.

You mention that in case of alignment mismatch -- when the new alignment is less than the old -- the implementation calls mremap.

I'd suggest that in case of too much extreme capacity -- when the new capacity is, say, 4x+ the new length -- the implementation should call shrink_to_fit. This has several benefits:

In most cases of no excess allocation, the allocation is reused and untouched. Perfect.
In cases of excess allocation, the allocation is still reused (with an in-place remap) and the likelihood is high that it'll be shrunk in place -- close to a no-op.
In cases where an allocation has been shrunk, and needs to be grown again, chances are that realloc will be able to grow it in place: there was space before, so as long as no other allocation stole it, it's still there.

From a number of users affected point of view, I'd expect that saving memory for the majority of users who'd get bitten by excess capacity is worth it. The tiny fraction of users for whom preserving the entire allocation is absolutely necessary can always ask for a dedicated explicit function, or create a crate in the meantime.

cuviper commented 9 months ago

I'd suggest that in case of too much extreme capacity -- when the new capacity is, say, 4x+ the new length -- the implementation should call shrink_to_fit.

I think this is reasonable. I might even go as far as "more than 2x", since that means we have more capacity than the unspecialized doubling-growth would have produced.

the8472 commented 9 months ago

Users don't read the documentation.

This is very poor argument imo. It may be a fact, but it's not something we should bend over backwards to accomodate.

If everyone programs based on vibes then we can just delete the docs. API contracts exist for a reason because people's expectations are not universal and thus won't lead to consistent outcomes. If you code to assumptions you build on a foundation of sand. Sometimes people just assume very silly things. Sometimes they assume reasonable things that happen to be false because they conflict with other reasonable things. In many cases the docs are intentionally vague to preserve implementation flexibility.

There's a lot of collect in the wild, written prior to this optimization, which may suddenly start exhibiting such an unhelpful behavior.

The optimization has existed since 2020. It was recently only extended to a few more cases. I agree that there are some issues with the generalization but the fundamental approach has already worked nicely for a while. Now that it's more broadly applicable someone has encountered an edge-case, ok, that happens. But that doesn't mean just because someone wrote a blog-post that it now must be changed.

That the system allocator doesn't support alignment-changing realloc is an issue worth considering though.

Is it reasonable for a user to complain if the resulting Vec didn't preserve the 100x excess capacity?

That depends on what is excess capacity. If you keep reusing a vec then there is no excess. The capacity will be filled up in the next round. So this is use-dependent.

I'd suggest that in case of too much extreme capacity -- when the new capacity is, say, 4x+ the new length -- the implementation should call shrink_to_fit.

As I have said several times that basing this on a factor (such as 4x) means that clear-and-recycle-allocation uses become impossible with this approach even though we have told people to do it this way.

We can change that of course, but then we're just breaking a different set of users which currently aren't making noises. Just because one set is visible doesn't mean the other set doesn't exist.

oscardssmith commented 9 months ago

I think this is reasonable. I might even go as far as "more than 2x", since that means we have more capacity than the unspecialized doubling-growth would have produced.

I would recommend not tying your hands too tightly here. Julia for example uses a growth strategy that grows at ~4x for small vectors but 1.13x for big vectors. I would suggest wording it as "will not use more than a reasonable constant factor" rather than documenting a strict cap at 2x.

kevincox commented 9 months ago

I'd suggest that in case of too much extreme capacity -- when the new capacity is, say, 4x+ the new length -- the implementation should call shrink_to_fit.

As I have said several times that basing this on a factor (such as 4x) means that clear-and-recycle-allocation uses become impossible with this approach even though we have told people to do it this way.

I think the original comment quoted is talking explicitly about .collect(). I think clear() is still expected to leave the capacity in place. I think this is most surprising on collect() because the the caller this is a "new" Vec. I wouldn't be surprised with some extra capacity (like push will over-allocate) but using hundreds of times the in-use space seems very surprising.

I think I would agree that after a collect there should be an upper bound on the capacity in relation to the length. I think that most users who are worried in reusing can use v.clear(); v.extend(i); or an explicit conversion API if type changes are needed.

LegionMammal978 commented 9 months ago

You mention that in case of alignment mismatch -- when the new alignment is less than the old -- the implementation calls mremap.

@matthieu-m I was trying to note that this isn't really the case in practice, due to the semantics of Rust's allocator APIs. The only use of the allocator within the in_place_collect implementation itself is a call to Allocator::shrink(), which per its documentation allows decreasing the required alignment. However, in stable Rust, the only available Allocator is Global, which delegates to the registered GlobalAlloc. Since GlobalAlloc::realloc() cannot change the required alignment, the implementation of <Global as Allocator>::shrink() must fall back to creating a brand-new allocation, memcpying the data into it, and freeing the old allocation, whenever the alignment doesn't remain exactly the same.

Therefore, the underlying allocator, provided by libc or some other source, has no opportunity to internally mremap() the data when the alignment is changed, since it has no way of knowing that the allocation is the same.

This could be fixed by letting the global allocator implement the full Allocator API, instead of restricting it to the GlobalAlloc API. Then, at least for small alignments (within the guaranteed malloc() alignment), the platform-dependent system allocator would be able to utilize realloc() instead of falling back to a manual copy-based implementation. However, this would require a substantial rework of the relationship between the global allocator and the Allocator trait.

kevincox commented 9 months ago

I would suggest wording it as "will not use more than a reasonable constant factor"

I would like to see an explicit factor, even if the standard library currently uses less than it. For example even if we use 2x now I wouldn't mind documenting at 10x. At least this provides an upper bound who people in situations where memory usage is very important can design against.

the8472 commented 9 months ago

@kevincox I'm specifically referring to this pattern which I have linked to in this comment

LegionMammal978 commented 9 months ago

We can change that of course, but then we're just breaking a different set of users which currently aren't making noises. Just because one set is visible doesn't mean the other set doesn't exist.

How would users be broken by not expanding this optimization as far as possible? Creating a new allocation in the different-layout case has been the behavior for every stable version of Rust! The "we have told people to use it this way" only applies to the same-layout case, unless you have another link to share.

Overall, there's an asymmetry here, between people who have lots of existing code who want collect() not to change its effective behavior by retaining massive unused capacity, and people who wish for a good way to recycle allocations for types with different layouts, regardless of how much spare capacity that creates. That way doesn't necessarily have to be the existing collect().

kevincox commented 9 months ago

I also don't think we can consider a GitHub comment as a binding promise. This is what we have the stabilization process for. I think we need an RFC process to either guarantee these reuse optimizations and when they apply or an RFC to stabilize an explicit API for this.

the8472 commented 9 months ago

I agree and I never said it's a binding promise. But it's imo still a quantum stronger than the no-promises-at-all that currently (don't) exist in the other direction, contrary to people's assumptions.

How would users be broken by not expanding this optimization as far as possible?

They would be broken if we promise a maximum-unused-capacity factor as some are asking for in this discussion.

TethysSvensson commented 9 months ago

I think it is worth pointing out here, that there are a few different problems that we are trying to optimize here. Ideally we can find a solution that works in all scenarios, but perhaps a hard tradeoff needs to be made between them.

Problem 1: Most users are happy to have the in-place optimization, because it results in faster code and lower memory usage most of the time. They would be happy if this optimization could be triggered in more cases. However, it breaks the principle of least surprise that it might result in significantly higher memory usage in some scenarios.

Problem 1a: As a consequence of the change in the optimization, we might see a large regression in the memory usage of some applications in the next stable release.

Problem 2: The in-place optimization has been the recommended ideom for reusing allocations for a while, and some users are now depending on it.

Problem 2a: If we introduce a limit on the capacity, these users will see a regression.

Problem 3: When using some allocator+alignment combinations, the in-place+realloc strategy will result in significantly worse performance than the alternative.

the8472 commented 9 months ago

Problem 3: When using some allocator+alignment combinations, the in-place+realloc strategy will result in significantly worse performance than the alternative.

I've just rechecked that and @LegionMammal978 is right that it is impossible to benefit from this optimization on stable so it makes sense to revert this part. I'll post a PR.

TethysSvensson commented 9 months ago

Problem 3: When using some allocator+alignment combinations, the in-place+realloc strategy will result in significantly worse performance than the alternative.

I've just rechecked that and @LegionMammal978 is right that it is impossible to benefit from this optimization on stable so it makes sense to revert this part. I'll post a PR.

Wouldn't it be possible to work around problem 3 by replacing mem::align_of::<SRC>() < mem::align_of::<DEST>() with mem::align_of::<SRC>() != mem::align_of::<DEST>() inside the in_place_collectible function?

the8472 commented 9 months ago

That's more or less it, yes. Plus a few other lines that can be simplified.

LegionMammal978 commented 9 months ago

How would users be broken by not expanding this optimization as far as possible?

They would be broken if we promise a maximum-unused-capacity factor as some are asking for in this discussion.

This would be covered by @kevincox's actual proposal for the factor:

"The capacity will never automatically grow more than 2x (or some other number) the maximum elements that have been part of the Vec since the last explicit capacity operation."

Since .clear() doesn't change the high-water mark for the element count, it would still be perfectly safe to continue using same-layout .collect() with such a guarantee.

LegionMammal978 commented 9 months ago

Though perhaps a more operational and less problematic factor may be, "While collecting an iterator created from a Vec<T> into a Vec<U>, if the length (in elements) doesn't grow, then the resulting capacity (in elements) must be no more than N × the original capacity (in elements)," for some N. This allows same-layout .collect() to preserve the initial capacity exactly, while also avoiding the creation of massive unused capacity if the element size is decreased.

the8472 commented 9 months ago

I suppose that's a possible way to do it but it seems kind of weird to me to distinguish between "it's mostly empty because it discarded a lot of items" and "it's mostly empty because it converted to a smaller type". Because you can end up with anywhere between 0% and 100% occupancy with either approach.

let v: Vec<[u8; 128]> = vec![[0u8; 128]; 1024];
let v: Vec<u8> = v.into_iter().flatten().collect();

Changes to a smaller type but the len/capacity ratio stays the same.

LegionMammal978 commented 9 months ago

That's why I stipulated "if the length (in elements) doesn't grow"; if it does, then obviously it would be fine to give .collect() more leeway. A more explicit solution for that would be to use "N × the original capacity (in elements) + M × the number of additional elements" as the bound. It doesn't necessarily matter that the resulting occupancy might be low, only that it isn't massively lower than the original occupancy.

the8472 commented 9 months ago

Well... you can also write the recycle in a different way without explicit clear

https://github.com/rust-lang/rust/blob/8424f8e8cdf07010967a57584fd647b30e930d4d/library/alloc/src/vec/in_place_collect.rs#L128-L136

LegionMammal978 commented 9 months ago

That's why my proposal considers the initial capacity? There could be a million different ways to write the recycle, and none of them would be broken by the proposal as long as they don't increase the element count.

the8472 commented 9 months ago

Oh you're trying to codify the old behavior. But then you basically can't extend allocation recycling to more (smaller) types. My goal was to say "write it this way and we'll make it more powerful over time".

If that's ruled out we might need a try_reuse_allocation method or something like that.

LegionMammal978 commented 9 months ago

But then you basically can't extend allocation recycling to more (smaller) types. My goal was to say "write it this way and we'll make it more powerful over time".

The whole problem is that "this way" is also the idiomatic way for collecting any iterator into a Vec, many existing users of which want to avoid an extremely low-occupancy result, especially when they didn't start with a low-occupancy source.

Another option would be to explicitly bound the output by occupancy, in saying (e.g.) that the output occupancy (as measured with the upper bound of the .size_hint()) must be at least half the input occupancy. In particular, to reuse the original allocation, the (upper-bound) total byte length of the resulting elements must be at least half the total byte length of the initial elements.

This allows for same-layout recycling of a cleared Vec (since both lengths are 0), same-layout recycling via filter_map(|_| None) (since FilterMap doesn't change the upper bound), flat_map recycling to preserve the total byte length (since it adjusts the upper bound accordingly, if it reports one at all), and general recycling that keeps the length but reduces the element size by up to half. But at the same time, if the input Vec has an occupancy of at least 50% (e.g., if it has been built up via .push()), then the output Vec won't have an (upper-bound) occupancy lower than 25%, which seems like a reasonable figure to me.

the8472 commented 9 months ago

many existing users of which want to avoid an extremely low-occupancy result

I want to contest the "many" here, since most of the time doesn't matter, but... too many words spent on quibbling.

Another option would be to explicitly bound the output by occupancy, in saying (e.g.) that the output occupancy (as measured with the upper bound of the .size_hint()) must be at least half the input occupancy;

Interesting. Finally a use for the upper bound 😄

I think there are still some scenarios where you'd end up shrinking the vector and then afterwards unnecessarily triggering a regrowth by pushing into it... but if we're relying on realloc being cheap then the occasional regrowing would also be cheap. So as long as we can avoid unnecessary shrinkage most of the time this might be good enough.

Ok, one further complication: We also want to avoid having a call to realloc sitting in the compiled code at all (even behind a non-taken branch) where possible since this allows some iterators to be reduced to a nop. After all that's how it always was before #110353 and still is in most cases. That's why there's a whole bunch of const in the optimization, to eliminate this at compiletime. I think this can be mostly based on the EXPAND_BY/MERGE_BY and type size calculation.

LegionMammal978 commented 9 months ago

I think there are still some scenarios where you'd end up shrinking the vector and then afterwards unnecessarily triggering a regrowth by pushing into it... but if we're relying on realloc being cheap then the occasional regrowing would also be cheap. So as long as we can avoid unnecessary shrinkage most of the time this might be good enough.

Could you elaborate on these scenarios? By far, I'd imagine that the typical scenario is that "an element out is worth an element in": just because you decreased the byte size of an element doesn't mean that you'll need multiples of the previous spare capacity in elements. The alternative scenarios would seem fairly niche, and an explicit capacity-preserving API for those scenarios would be far more clear in its intent than the extremely generic collect().

One exception, of course, is using flat_map() or flatten() to break each element into multiple subelements (e.g., RGB triples into components), which is why I suggested the length-ratio-based adjustment. But in that instance, keeping the full elements' worth of spare capacity still seems more like an optimization than something broadly useful; why split the vector into subelements, if you still have more data to push into it?

Ok, one further complication: We also want to avoid having a call to realloc sitting in the compiled code at all (even behind a non-taken branch) where possible since this allows some iterators to be reduced to a nop. After all that's how it always was before #110353 and still is in most cases. That's why there's a whole bunch of const in the optimization, to eliminate this at compiletime. I think this can be mostly based on the EXPAND_BY/MERGE_BY and type size calculation.

Yeah, my suggestion was meant to allow for recycling to be guaranteed at compile time in most cases where it's permitted. The EXPAND_BY (if I understand it correctly) looks like it's equivalent to the upper-bound length as a multiple of the original length, when that can be determined at compile time. So if we know that 2 * EXPAND_BY * size_of::<Out>() >= size_of::<In>(), then it's guaranteed that reusing the allocation won't violate the minimum occupancy. And then when shrinking the allocation, an output capacity of in_capacity * EXPAND_BY elements will preserve the original occupancy.

mplanchard commented 9 months ago

I just wanted to chime in as someone who would prefer the optimization not do any fancy heuristics to resize the allocation. Documenting that collect() will attempt to reuse the original allocation if possible seems sufficient, especially if adding a note about unused capacity when changing the type of the T in the Vec and recommending explicitly shrinking in that case if desired.

It seems much more predictable to me and easier both to grok and to explain for the allocation to retain its original on-disk size, even if that increases its capacity. The behavior described in the blog post does not strike me as a bug once you understand that collect() can reuse the allocated memory.

Most of the time, we try to avoid allocations, so this is an optimization we are excited to see expand in stable.

hgomersall commented 9 months ago

I've read all the comments and wanted to add my perspective which I don't think is well covered.

When using collect(), it's not immediately apparent when an iterator and its respective collect target can use an existing allocation or not. There are obvious situations in which you think it might be the case, but as a user one always wonders whether some intermediate iterator is going to maintain the allocation. I guess it's possible to document every case: "map will always keep the allocation in such and such a situation", but it feels the potential for confusion or misdocumentation is extensive (especially when it comes to iterators that aren't backed by any memory at all).

I basically assume that collect will not reallocate, but I've never found myself looking to optimise this particular aspect.

The point I'm getting to is that IMO collect should do something sensible in all cases, which probably means not maintaining a big allocation if the thing being collected is much smaller. If you get the previous allocation, then that's great, but it's an optimisation not an expectation. It feels to me therefore that there is probably a really good argument to be made for an extension to the AP

skogseth commented 9 months ago

My expectation of iterators in Rust is that they are an abstraction over loops, so a function like this

fn primes(ints: Vec<u32>) -> Vec<u32> {
    let mut primes = Vec::new();
    for int in ints {
        if is_prime(int) {
            primes.push(int);
        }
    }
    primes
}

can be written more cleanly as this

fn primes(ints: Vec<u32>) -> Vec<u32> {
    ints.into_iter().filter(|&int| is_prime(int)).collect()
}

However, due to this optimization these two functions have different outputs for the same input; I can notice the optimization on the output of my function. In my view it then partially fails at being an optimization.

All that being said, I think there is significant leeway in the fact that users generally don't have an expectation for the exact capacity of a Vec, it's only when the difference is very big that it becomes problematic. I like the idea of shrinking the capacity based on some threshold.

the8472 commented 9 months ago

That entirely depends on how you write the loop! If you write it as a loop over &mut Vec<u32> then the capacity-preserving falls out of it naturally.

The "zero-cost abstraction" motto is often spelled out as "you couldn't write it better manually". And the hyper-optimized code-golf solutions of course don't contain any extra allocations when they're unnecessary and only manually shrink the buffer in the end if they know that they won't need the extra space anymore.

collect() does not know what you'll do with the resulting vector. You might extend() it further or push() into it, in which case the extra capacity will be useful.

And this is a one-way road. you can discard excess capacity with shrink_to_fit(), but you can't undiscard it if collect() were to do it automatically.