TethysSvensson commented 9 months ago

(This bug report was inspired by this blog post https://blog.polybdenum.com/2024/01/17/identifying-the-collect-vec-memory-leak-footgun.html)

After #110353 was landed, in-place iteration can reuse allocations in many more places. While this is a good thing, in some cases this can cause overly large capacity for the destination vector.

Additionally, in some this will cause the destination vector to have a very large capacity even for non-shared allocations. In my opinion, this should never happen.

For an example see this code:

fn dbg_vec<T>(v: &Vec<T>) {
    println!(
        "vec data ptr={:?} len={} cap={}",
        v.as_ptr(),
        v.len(),
        v.capacity()
    );
}

fn main() {
    let v1 = (0u16..128).map(|i| [i; 1024]).collect::<Vec<_>>();
    dbg_vec(&v1);
    let v2 = v1.into_iter().map(|x| x[0] as u8).collect::<Vec<_>>();
    dbg_vec(&v2);
}

On stable this code works as expected, i.e. two different vectors with reasonably lengths and capacities.

On beta however, v2 will have a capacity of 262144, even though it does not share an allocation with v1. If you remove the as u8 part, then the allocations will be shared, but the capacity is still overly large.

My suggested fix is:

Do not attempt to use in-place collection if the destination alignments do not match, as reallocation will always be necessary. On my machine, these reallocations do not appear to return the samme allocation, so we might as well allocate up front to avoid a memcpy.
The behavior regarding capacities in the remaining cases needs to either change or be documented somewhere.

Meta

I am running NixOS with rustup.

$ rustup --version --verbose
rustup 1.26.0 (1980-01-01)
info: This is the version for the rustup toolchain manager, not the rustc compiler.
info: The currently active `rustc` version is `rustc 1.77.0-nightly (6ae4cfbbb 2024-01-17)`
$ rustc +stable --version --verbose
rustc 1.75.0 (82e1608df 2023-12-21)
binary: rustc
commit-hash: 82e1608dfa6e0b5569232559e3d385fea5a93112
commit-date: 2023-12-21
host: x86_64-unknown-linux-gnu
release: 1.75.0
LLVM version: 17.0.6
$ rustc +beta --version --verbose
rustc 1.76.0-beta.5 (f732c37b4 2024-01-12)
binary: rustc
commit-hash: f732c37b4175158d3af9e2e156142ffb0bff8969
commit-date: 2024-01-12
host: x86_64-unknown-linux-gnu
release: 1.76.0-beta.5
LLVM version: 17.0.6

hgomersall commented 9 months ago

If you write it as a loop over &mut Vec<u32> then the capacity-preserving falls out of it naturally.

But the iterator chain consumes a Vec and returns a different Vec. If collect() was acting on a &mut Vec<> (or a derivative) then I think the argument would be different.

kevincox commented 9 months ago

On the other hand if a new version of Rust extends this optimization to new code you will balloon your memory usage. This may not be technically breaking but it is effectively breaking.

If one was being truly paranoid they would need to call shrink_to_fit after any instance of .collect<Vec<_>>() where the resulting Vec is expected to live for a significant period of time. Otherwise any Rust update can cause very surprising memory usage.

I don't want everyone running out to add shrink_to_fit after every collect call. So I think it makes sense to place an upper bound of the unused (yes, you may use it but we don't know at the time) capacity that can result from a collect. This ensures that Rust has reasonable behaviour in all situations. Then those who really need high performance can explicitly reuse allocations or those who need really need low memory usage can be sure to call shrink_to_fit everywhere. But I think the defaults should behave well for the common case.

scottmcm commented 9 months ago

Documenting that collect() will attempt to reuse the original allocation if possible seems sufficient, especially if adding a note about unused capacity when changing the type of the T in the Vec and recommending explicitly shrinking in that case if desired.

The thing I'm not a fan of about that is that it's common to collect from an unknown iterator. So just documenting this means that it's implicitly telling people that if they're collecting a passed-in iterator (perhaps after filtering it), they should probably do some conditional shrinking of the capacity afterwards, because otherwise they could accidentally be wasting an arbitrarily-high amount of memory.

It's not at all clear to me that doing that -- and thus encouraging a bunch of PRs to crates that use collect -- is a net better outcome.

The behavior described in the blog post does not strike me as a bug once you understand that collect() can reuse the allocated memory.

Regardless of whether it's a "bug" it's something that the experience with Java's substring shows is a footgun: https://stackoverflow.com/questions/14161050/java-string-substring-method-potential-memory-leak

When the goal is to use this for re-using buffers across different types, then the programmer controls the source and target Vecs, so I think giving those people something specific to make that desire clearer would work fine, and is a better trade-off than having to think about it for every collect.

"Never a complexity class more storage than writing a push loop" seems like an entirely reasonable expectation from collect, to me. (I do agree that we definitely shouldn't give any strict guarantee about it, just like we don't give strict guarantees about push capacities.)

EDIT: ...and @kevincox just said basically the same thing as we typed in parallel.

the8472 commented 9 months ago

But the iterator chain consumes a Vec and returns a different Vec.

You can see that as a method that consumes the Vec and still internally operates on &mut Vec<> and then does into_raw_parts/from_raw_parts to do any type changes if any. In the meanwhile it operates on the raw storage. That's basically how the optimization works

That's very low-level code, sure. But it's the best possible loop you could write for things that don't need extra storage.

they should probably do some conditional shrinking of the capacity afterwards, because otherwise they could accidentally be wasting an arbitrarily-high amount of memory.

That's a non-issue if you drop the vec soon. Or run yet another transformation on this. It's not like the need to shrink is universal. For years nobody noticed with the existing optimization.

One person made a blog post, but didn't consider it enough of an issue to, well, open an issue. Someone else did, even though they weren't personally affected by this!

cuviper commented 9 months ago

"Never a complexity class more storage than writing a push loop" seems like an entirely reasonable expectation from collect, to me. (I do agree that we definitely shouldn't give any strict guarantee about it, just like we don't give strict guarantees about push capacities.)

+1

And if there's a concern that some allocators may not shrink in place -- a final realloc memcpy is still a reasonable tradeoff IMO.

Also, you can't really judge lack of complaints when the expanded scope of the reuse optimization hasn't reached stable yet. The people piling in here in alarm all count too! I do think it's worth taking a step back.

the8472 commented 9 months ago

The people piling in here in alarm all count too!

There's also people who like the optimization e.g. https://github.com/rust-lang/rust/issues/120091#issuecomment-1899638455 or this comment on reddit

And I'm not sure a popularity contest is a good way to judge this since the reporting biases who is responding to the issue.

cuviper commented 9 months ago

I think there's enough doubt to justify a revert while we reconsider. The capacity shrink is a less drastic compromise, whether we set the threshold at 2x or even more generously at 4x as first suggested.

the8472 commented 9 months ago

"Never a complexity class more storage than writing a push loop" seems like an entirely reasonable expectation from collect, to me.

Note that this already is not the case on stable. And as mentioned several times in the thread some people use that to avoid allocations even when the complexity-class diffference is... unbounded (which trivially happens when len() == 0). Hence the more complex heuristics such as those suggested in https://github.com/rust-lang/rust/issues/120091#issuecomment-1899271788

matthieu-m commented 9 months ago

@cuviper I believe it would be fitting to mark this issue as regression-from-stable-to-nightly (possibly to-beta?) until it is clarified.

I don't think it's a behavior we'd want to sneak in stable without explicit acknowledgement that it is intended, and the label would help ensuring it doesn't end up there by default.

cuviper commented 9 months ago

Done -- but do note that a regression label is not a promise that anything will change.

the8472 commented 9 months ago

120116 addresses one aspect that's definitely an unintended regression. It will also make this optimization apply in fewer cases. It's unclear whether it would affect the thing that motivated the blog-post since that only contains simplified examples, not the original code that triggered this.

LegionMammal978 commented 9 months ago

"Never a complexity class more storage than writing a push loop" seems like an entirely reasonable expectation from collect, to me. (I do agree that we definitely shouldn't give any strict guarantee about it, just like we don't give strict guarantees about push capacities.)

One particular problem with that, which was hit by the post's author, is with a filter() that removes most of the elements. The question is, is it fair for us to continue preserving the full capacity unconditionally (since that is the current same-layout behavior in stable Rust), under the assumption that a typical filter() will keep a significant fraction of the elements? Such an assumption seems reasonable to me, matching the behavior of Vec::retain(). And as @the8472 noted, we want to avoid conditional shrink_to_fit()s whenever possible, and try to stick to compile-time information when deciding if the allocation can be recycled. That's why I suggested to consider only the upper bound on the number of elements the iterator might return.

the8472 commented 9 months ago

I think the options (besides reverting) are:

don't change the behavior, add documentation and release notes.
This probably works fine in most cases (e.g. there were no maxrss blowups in rustc) but a few crates may have to add shrink_to_fit() to reduce memory footprint.
Perhaps we could check if there's an uptick in OOMs in crater? Though tests probably aren't data-heavy enough to exercise this.
always shrink if there's an excess above some fixed factor.
As noted previously this would prevent some uses that have been informally suggested to users to recycle allocations that currently work on stable.
This also pessimizes users who keep extending/pushing into the vec after it has been collected.
Additionally having a realloc always sitting in the in-place code may affect optimizations. On stable it doesn't exist at all and on nightly it's often removed via some const conditions.
shrink based on heuristics that accommodate those existing allocation-recycling patterns while not leaving excess in the additional cases that were enabled by #110353
This is a bit weird since it's essentially codifying the distinction between legacy behavior vs. the additional optimizations.
The upside is that cases that have not caused issues in previous releases will keep working as they are.

And separately we could add a clear_and_try_recycle_allocation(self: Vec<T>) -> Vec<U> method. So that people have to rely less on complicated optimizations.

cuviper commented 9 months ago

I definitely think we should be exploring APIs to explicitly meet the needs of those informal suggestions.

cuviper commented 9 months ago

2. always shrink if there's an excess above some fixed factor. As noted previously this would prevent some uses that have been informally suggested to users to recycle allocations that currently work on stable. This also pessimizes users who keep extending/pushing into the vec after it has been collected.

Note that it doesn't have to be shrink_to_fit entirely -- it could be an unconditional shrink_to(X * len), keeping some excess if it exists (and no-op if it's already smaller).

the8472 commented 9 months ago

I was not implying that it would be the shrink_to_fit variant. The pessimization may also happen if you do some refilling that happens to use all the capacity that would have otherwise been retained.

It's just not possible for collect() to know what uses will happen downstream of it. They might use all of the capacity or none of it. If we add an automatism the user can't opt out of it because the cost will already have been paid at that point.

kevincox commented 9 months ago

Such an assumption seems reasonable to me, matching the behavior of Vec::retain().

Personally my expectations of retain is very different than collect. With retain it takes a &mut self so I see it as the same object. I expect that capacity is unchanged. With collect it seems like this creates a new object. The docs don't explicitly say it creates a new object but imply it with phrases like "create instances" and "Using collect() to make a String".

For example it is very surprising that calling this function with different types can use hugely different amounts of memory for the return value. Note that the function that creates the vector itself has no actual idea if it will return something with lots of extra capacity.

fn test(i: impl IntoIterator<Item=u32>) -> Vec<u32> {
    i.into_iter().filter(|_| false).collect()
}

fn main() {
    eprintln!("vec: {}", test(vec![1; 1000]).capacity()); // 1000
    eprintln!("vec: {}", test([1; 1000]).capacity()); // 0
}

Playground

Personally my mental model of collect::<Vec<_>>() looks something like the below code. I expect that this is similar to many users expectations (or maybe they don't even consider the upfront reserve).

fn my_collect<T>(i: impl Iterator<Item=T>) -> Vec<T> {
    let mut out = Vec::with_capacity(i.size_hint().0);
    for e in i {
        out.push(e);
    }
    out
}

And I appreciate that it is nice to add allocation reuse here, even if it is user-perceptible. But as others have agreed with holding around "arbitrary" amounts of extra memory is very surprising.

My preferred outcome would be:

Add a general rule for Vec that automatic allocations will use at most some constant factor of the in-use memory.
Update collect to shrink_to_fit if that rule would be violated due to allocation reuse.
Add an explicit API for attempting to reuse an allocation to a new Vec with a potentially different type.

I think all of these are quite important changes:

This makes implicit promises that users rely on well-defined so that everyone can be in agreement.
Avoids surprising memory usage by "new" Vec.
Allows users who want to reuse allocations do so in a reliable and explicit way.

I think one potentially missing thing would be how to handle reliable allocation reuse while changing the type. IIUC this can really only be done if the new type is <= the size of the original type. But this would need some more specific API if we wanted to enable it to be explicit.

cuviper commented 9 months ago

It's just not possible for collect() to know what uses will happen downstream of it. They might use all of the capacity or none of it. If we add an automatism the user can't opt out of it because the cost will already have been paid at that point.

Right, we can't know -- but that's why I think this optimization should tread closer to as-if behavior. Limiting to 2 * len is roughly that, but actually push's growth would have a capacity more like len.next_power_of_two(). (Unless we consider exact-size with_capacity results, then it would be shrink_to_fit.)

kevincox commented 9 months ago

I think the "as-if" comparison is a great one! I believe that we should keep the behaviour similar to what is "expected" by the simple model of these combinators. Managing to perform clever optimizations is great when they are all upsides. But when clever optimizations accidentally bite you it really sucks. So I'd rather make sure that the optimizations are improvements in the vast majority of use cases. If we have optimizations that may be helpful or may be hurtful it is probably better to allowing the user to opt-in explicitly, even if generally the benefit is positive. Otherwise seemingly benign changes to the code like replacing a for x in i { vec.push(x) } with a collect will have surprising consequences.

I get that this is definitely a grey area because it is impossible to define what is expected by users and because Vec capacity is not exactly defined. However there are some expectations here that users reasonably depend on (such that push doesn't allocate 1000x more capacity than len) and I think based on the attention this ticket has got shows that even if this is a rare case it is surprising to many users. So I would default to the not surprising case, even if it isn't optimal performance in other cases.

matthieu-m commented 9 months ago

On the topic of explicit APIs, I was thinking that a collect_with_capacity may be a missing in the APIs here.

Many collections today allow hinting at the capacity, and reserving at least that much. That is, if I know I will be further using a Vec or HashMap I collect into, I'd like to be able to hint the capacity I guess I am going to need, and depending on the usecase be conservative or aggressive in the estimate.

Today, this is possible in two steps to ensure the target capacity:

let mut v = Vec::with_capacity(x);
v.extend(some_iterator);

A minor issue is that it is two steps, which leads to the major issue that allocation reuse is not possible.

A collect_with_capacity API would, however, address both issues:

Single step.
Allow allocation reuse -- without undue excess.

Further, because the target capacity is known in advance, inlining collect_with_capacity allows the compiler to optimize the selection of which collection algorithm to use, as it knows upfront whether it's going to need to extend, shrink, or do nothing.

It does still suffer from one disadvantage: there is no guarantee it will reuse the underlying allocation. This is annoying when reuse is necessary for performance reasons.

hgomersall commented 9 months ago

On the topic of explicit APIs, I was thinking that a collect_with_capacity may be a missing in the APIs here.

What would that do if the target type doesn't have a concept of capacity? Would this be based on a new trait like FromIteratorWithCapacity?

Edit: I guess I'm still trying to work out if the problem is a Vec problem, or is it more general. Can we allow downstream types to make use of the memory, or is that just too far out of scope?

Istvan91 commented 9 months ago

I am not sure if I'm following the whole collect_with_capacity concept:

To me there are two different variants:

I know exactly what capacity I want in the target type (this is what I interpret as collect_with_capacity)
I want to keep the existing allocation, because I will be still appending to the (potential vec) without knowing how much elements. (which I would call something like collect_with_current_allocation). Calling it capacity in this case could be really confusing if the element size is different between the original T and the target U.

=> Two me it seems there are two API missing: One where I can define the (minimum) capacity of the target collection and one where the original collection is kept without any shrinking attempt.

hgomersall commented 9 months ago

One where I can define the (minimum) capacity of the target collection and one where the original collection is kept without any shrinking attempt.

I think the point is that with_capacity would use the existing allocation up to capacity. It's still not explicit though. I'd prefer an API that guarantees reuse (or fails). collect_with_current_allocation would still expose the current issue.

Istvan91 commented 9 months ago

I think the point is that with_capacity would use the existing allocation up to capacity

This only works if the if the target collection is actually smaller than capacity. With something like .iter().flatten().collect_with_capacity(1234) you could end up with a result that is much bigger than the specified capacity or even bigger than the original allocation. (E.g. where you have Vec<Vec>, the outer vec has just a few items, and the inner vecs a few thousands.)

collect_with_current_allocation would still expose the current issue.

except the current "issue" is not a universal issue: I was proposing collect_with_current_allocation, because I have uses cases where I want this behaviour: Keep the current allocation wherever possible, or expand if necessary. (One of the main reason why we are "here" today is because there was demand for this behaviour in the first place - the change leading to this issue was not done due to sheer boredom, but because people where asking for allocation recycling.)

I'd prefer an API that guarantees reuse (or fails).

So this would be yet another use case, potentially requiring yet another API ;)

EDIT: fix quoting

hgomersall commented 9 months ago

One of the main reason why we are "here" today is because there was demand for this behaviour in the first place - the change leading to this issue was not done due to sheer boredom, but because people where asking for allocation recycling

Absolutely. It does feel to me that the problem is there is no generic way for a type to expose its allocated buffer in such a way that a consuming type can take over that allocation. Instead the collect case on Vec is a std special case in which the behaviour diverges from what is possible in normal user code (hence, probably, why the expectations are not necessarily consistent with the behaviour).

the8472 commented 9 months ago

I'd prefer an API that guarantees reuse (or fails).

I doubt that we can expose this as an API anytime soon or generically. In practice it only works on a few (although very important) types, what exactly works changes over time, relies on some unstable and very unsafe features and can't be implemented by user crates. So it's not something that really makes sense for FromIterator.

This will have to remain in the realm of non-guaranteed optimizations for a while.

A Vec::clear_and_try_recycle_allocation would be different because it's far less generic and, at least for A=Global, can already be implemented on stable with just a bit of unsafe.

Storyyeller commented 9 months ago

It's unclear whether it would affect the thing that motivated the blog-post since that only contains simplified examples, not the original code that triggered this.

It was (u64, u128) -> (u32, u32), so requiring matching alignments would have prevented the optimization in my case.

the8472 commented 8 months ago

This was discussed in a libs meeting but no conclusion on whether the range of possible behaviors should be narrowed was reached. We did agree that it should at least be documented (#120355) and the recent changes warrant a release note (#120004).

Additionally #120116 removes the cases in which the attempted optimization would never have been useful on stable and that also caused the regression that lead to the blog post. After the removal there still are some additional cases in that were not optimized on previous stable releases. We're going to let those propagate to stable and wait for feedback. The assumption is that only very few crates will need to apply shrinking.

the8472 commented 5 months ago

Documentation has been added, the case where the optimization wasn't effective has been removed and there have been no further reports so I'll close this for now.

rust-lang / rust

In-place iteration results in too big allocations #120091

Meta

120116 addresses one aspect that's definitely an unintended regression. It will also make this optimization apply in fewer cases. It's unclear whether it would affect the thing that motivated the blog-post since that only contains simplified examples, not the original code that triggered this.