rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.35k stars 12.72k forks source link

Vec's reallocation strategy needs settling #29931

Open Gankra opened 8 years ago

Gankra commented 8 years ago

Background

Currently, Vec's documentation frequently contains the caveat that "more space may be reserved than requested". This is primarily in response to the fact that jemalloc (or any other allocator) can actually reserve more space than you requested because it relies on fixed size-classes to more effeciently dole out memory. (see the table here)

Jemalloc itself exposes a malloc_usable_size function which can be used to determine how much capacity was actually allocated for a pointer, as well as nallocx which can be used to determine how much capacity will be allocated for a given size and alignment. Vec can in principle query one of these methods and update its capacity field with the result.

The question at hand is: _is this ever worth it to check usable_capacity, and is it ever undesirable?_

This issue was kicked off by Firefox devs who have experience doing exactly this optimization, and claim it being profitable. Facebook has also claimed excellent dividends in making its allocation strategy more jemalloc friendly, but to my knowledge do not actually query usable_capacity.

Currently our alloction strategy is almost completely naive. One can see the source here, which boils down to:

let new_cap = if self.cap == 0 { 
    4
} else { 
    2 * self.cap 
}

To the best of my knowledge, this is all the capacity logic for FBVector, which boils down to:

let new cap = if self.cap == 0 {
    max(64 / size_of_elem, 1)   // empty
} else if self.cap < 4096 / size_of_elem) {
    self.cap * 2                // small
} else if self.cap > 4096 * 32 / size_of_elem) {
    self.cap * 2                // huge
} else {
    (self.cap * 3 + 1) / 2      // moderate
}

The only major deviation from Rust today being a 1.5 growth factor for "moderate" sized allocations. Note that there is a path in their small_vector type that queries malloc_usable_size.

Unfortunately I've been unable to find good information on what Firefox does here. The best I could find is this discussion which demonstrates big wins -- 6.5% less calls to malloc and 11% less memory usage. However the effort seems to peter out when it is asserted that this is obsoleted by Gecko just using power-of-two capacities. mxr and dxr seem to suggest it's only used for statistics.

Hopefully Firefox and Facebook devs can chime in on any experiences.

Going Forward

I have several outstanding concerns before we push further on this matter.

Rust is not C++, so I am partially suspicious of any benchmarks that demonstrate value in C++. In particular, the proliferation of size_hint and extend could completely blast away any need to be clever with capacities for most programs. I would also expect a large Rust application to frob the allocator less in general just because we default to move semantics, and don't have implicit assignment/copy constructors. Also, Rust programs are much more "reckless" with just passing around pointers into buffers because the borrow checker will always catch misuse. Slices in particular make this incredibly ergonomic. We need Rust-specific benchmarks. I'm sure the Hyper, Servo, Glium, Serde, and Rust devs all have some interesting workloads to look at.

Rust's Vec is basically the community's malloc/calloc, because actual malloc is unstable and unsafe. We explicitly support extracting the Vec's pointer and taking control of the allocation. In that regard I believe it is desirable for it to be maximally well-behaved and performant for "good" cases (knowing exactly the capacity you want, having no use for extra capacity). There's a certain value in requesting a certain capacity, and getting the capacity. Both nallocx and malloc_usable_size are virtual function calls with non-trivial logic, and may have unacceptable overhead for responsible users of Vec.

Note that anything we do must enable Vec and Box<[T]> to reliably roundtrip through each other without reallocating in certain cases. If I invoke shrink_to_fit or with_capacity, it ought to not reallocate when I try to convert to a Box<[T]>. As far as I can tell, this should be possible to uphold even when using malloc_usable_size because jemalloc is "fuzzy" and only requires that the given size is somewhere between the requested one and the usable one.

Anything we do must also work with allocators that aren't jemalloc. This may be as simple as setting usable_size = requested_size for every allocator that isn't jemalloc.

CC @pnkfelix @erickt @reem @seanmonstar @tomaka @SimonSapin @larsberg @pcwalton @Swatinem @nnethercote

CC #29848 #29847 #27627

CC @rust-lang/libs

EDIT(workingjubilee, 2023-05-06): Originally one of the links touched on the master branch of folly instead of a permalink to a specific commit. Linkrot claimed the one to jemalloc's docs. These have been updated with a permalink and a Wayback Machine link, respectively.

aturon commented 8 years ago

cc @briansmith

jdm commented 8 years ago

cc me

briansmith commented 8 years ago

The question at hand is: is this ever worth it to check usable_capacity, and is it ever undesirable?

I guess you mean s/usable_capacity/malloc_usable_size/.

If I invoke shrink_to_fit or with_capacity, it ought to not reallocate when I try to convert to a Box<[T]>

I think that's a "nice to have" but not very important. Is real-world code doing shrink_to_fit and then into_boxed_slice? I searched the uses of into_boxed_slice and into_boxed_str in the Rust and Servo repos and it looks like they are rarely used in those repos. I think using those functions is a code smell, actually.

Rust's Vec is basically the community's malloc/calloc,

I disagree with this. Vec is the data structure you use by default, and it should have good performance by default. For example, its the thing I collect iterators into by default. Some people might be using Vec to simulate malloc, but I think that's much more rare. I use with_capacity(N) because I know I'm going to be pushing at least N elements into the vector and I don't want to be wasteful; I'm noever using it because I want to create a boxed slice with N elements in it. It makes sense to have a separate interface for people that want a "safe" malloc; perhaps that separate interface would just be additional stuff added to Vec. But I think the behavior of the common (existing) interface should be geared towards more naive uses.

SimonSapin commented 8 years ago

Rust's Vec is basically the community's malloc/calloc

That’s a hack to work around the lack of proper stable API, not how it should be.

bluss commented 8 years ago

Regardless of what we want to do now, we need to remove the "guarantee" in the nightly docs about capacity, which locks us in. We have no reason to promise all kinds of things about Vec internals.

Ms2ger commented 8 years ago

CC @glandium

Swatinem commented 8 years ago

Maybe instead of actually using usable_size (which has a non-zero cost as you mentioned), the better thing to do would be to just round up to the nearest power-of-two (in bytes) for small allocations and to the nearest MiB for large ones, like mozillas nsTArray does.

I have also looked a bit at the code, and at least VecDeque rounds up to the nearest pot (in elements), probably because it does a + 1 which is like the most wasteful thing to do related to this issue. The docs for VecDeque btw say it reserves space for "at least" that many elements, not exactly.

Btw, just grepping through the rustc code for with_capacity reveals a few cases where it does a + 1 as well.

Gankra commented 8 years ago

VecDeque needs to maintain its power of two capacity in order to support efficient modular arithmetic (just a bit mask). HashMap does the same.

huonw commented 8 years ago

FWIW, one can use constant-division tricks to do (relatively) efficient modular arithmetic, if the bit pattern is fixed (e.g. if the capacities are guaranteed to be either 1 << n or 3 << n). Of course, branching and then doing the multiplications/shifts for the latter case definitely isn't free.

(I believe x % (3 << n) can be computed via something like:

x - ((x * C) >> (n + 1)) * (3 << n)

for some constant C.)

Gankra commented 8 years ago

Oh and they both keep some slack capacity for similar perf reasons. Hashmap will only use 91% of its cap because it keeps access times down. VecDeque keeps an empty slot so that the "two indices" representation can differentiate between the empty and full state without us maintaining some flag that constantly needs to be checked/maintained.

I would also like to slightly dissent on the point of Vec being a good malloc. It's an excellent memory-safe and exception-safe interface for acquiring, initializing, deinitializing, and releasing an allocation. It's a bad choice for lower-level uses (all the other collections), of course.

arthurprs commented 8 years ago

It's important to note that jemalloc isn't the only allocation that exhibits this behavior (having some slack), actually many do.

I assumed we had the trio: reserve, reserve_exact and shrink_to_fit; exactly in order to allow reserve to be highly optimized and do smart reallocation, like in this case.

larsberg commented 8 years ago

I think you've got the wrong larsberg. I believe you want @larsbergstrom

kajalsinha commented 8 years ago

hi,

I am learning rust and when I run the attached programs with parameter 100 million in Point size, Rust takes much more memory in vector. Rust compiler generated program (-O) takes 4.5 GB RAM whereas clang++ generated program (-O) takes 3GB.

I tried a swift program as well which tool 3 GB around (same as C++).

The time command output for rust and c++ respectively is:

[Rust 1.5] MacBook-Pro:Pointrust kajal$ time cargo run --release 100000000 Running target/release/pointrust 100000000 -13091672652, 50607436937

real 0m25.189s user 0m15.511s sys 0m9.144s

[C++ compiled using clang++ with -O flag] MacBook-Pro:somedir kajal$ time ./a.out 100000000 Average is 1073876718, 1073753699

real 0m21.816s user 0m18.231s sys 0m3.167s

Queries:

  1. Why sys time is more in rust and what it indicates?
  2. Can this 1.5 times more memory allocation be reduced to same as C++.

pointrust.zip PointTestCpp.cpp.zip

ticki commented 8 years ago

@kajalsinha You might want to specify the capacity when you create the vector (see with_capacity).

kajalsinha commented 8 years ago

The memory consumption remains 1.5 times even with_capacity

hexsel commented 8 years ago

@kajalsinha your C++ program is using long, Rust is using f64, right?

According to google, a long may be 32 bits, I expect f64 is 64 bits. If this is the case, you should be careful with any comparisons, integers are usually faster than floating point even for same size.

Caveat: I was never a C++ programmer, and google is occasionally wrong.

notriddle commented 8 years ago

The C/C++ standards define long as a number with no less capacity than an int and no more capacity than a long long. That's it. On Mac OS X (which @kajalsinha is using, based on the console paste), it's one word long: 32 bits on 32-bit OSX and 64 bits on 64-bit OSX. For a fair comparison with Rust, you should use an isize.

The fact that you're using integers in C and floating point numbers in Rust also accounts for the performance difference.

Storyyeller commented 8 years ago

What is the point of even having reserve_exact(), if it provides no additional guarentees over reserve()?

ticki commented 8 years ago

@Storyyeller reserve guarantees that cap is greater than or equal to the specified amount. reserve_exact guarantees that cap is equal to the specified amount, that however does not mean the allocator can't give more memory (usable_size) back.

ollie27 commented 8 years ago

that however does not mean the allocator can't give more memory (usable_size) back.

@ticki what does that mean if it doesn't mean that the capacity can be greater than requested?

I believe that reserve and reserve_exact make exactly the same guarantee, namely that you can push the specified number of elements onto the Vec without any reallocation. The only difference is the strategy for reallocation. reserve might try to reserve more space than requested to avoid reallocations if you push more elements than you requested. reserve_exact will only try to reserve enough space for the extra elements. This is useful to reduce memory usage if you know exactly how many more elements you need to push or at least have an upper bound.

ticki commented 8 years ago

what does that mean if it doesn't mean that the capacity can be greater than requested?

To make bookkeeping of the memory more smooth, most allocators has some fixed sizes you can allocate from. If the given value is not one of these, it will round up to the nearest fitting size.

ollie27 commented 8 years ago

Right, but that doesn't mean that "reserve_exact guarantees that cap is equal to the specified amount". In fact the docs say that that is not the case:

Note that the allocator may give the collection more space than it requests. Therefore capacity can not be relied upon to be precisely minimal.

ticki commented 8 years ago

@ollie27 No, the point is that the capacity is updated with the (uncanonicalised) memory returned by the allocator.

Mark-Simulacrum commented 7 years ago

Closing. If we want to pursue this, a discussion (potentially leading to PRs) on internals.rust-lang.org would be best.

Mark-Simulacrum commented 7 years ago

Actually, I'm going to reopen. This is something that has potential performance impacts and we should probably keep tracking.

gnzlbg commented 7 years ago

I don't see the growth factor being discussed much here.

@Gankro mentions FBVector and its growth factor of 1.5. A factor of 1.5 allows reusing previously freed memory after 4 reallocations, 1.45 allows memory reuse after 3 reallocations, and 1.3 allows reuse after 2 reallocations. Which one is better is application dependent, although some claim that 1.5 is close to optimal.

What seems clear is that a growth factor of 2, that is, doubling the size of the vector on reallocation, is the worst possible growth factor that can be chosen because it never allows the vector to reuse any previously freed memory. That is, for a vector that reallocates multiple times, a growth factor of 2 means that the memory freed from these reallocations will never be usable for a future allocation of the vector.

Since Rust currently uses a growth factor of 2, a small incremental step could be to try a smaller growth factor and see if that improves memory usage. Any one will do, although 1.3, 1.45 or 1.5 make sense to me.


From the fbvector docs:

Of course, the above makes a number of simplifying assumptions about how the memory allocator works, but definitely you don't want to choose the theoretically absolute worst growth factor (2). fbvector uses a growth factor of 1.5. That does not impede good performance at small sizes because of the way fbvector cooperates with jemalloc (below).

steveklabnik commented 7 years ago

Is this even something we can change at this point?

gnzlbg commented 7 years ago

Also, I don't know if the code of fbvector.h has changed much, but the part cited by @Gankro now looks like this:

  size_type computePushBackCapacity() const {
    if (capacity() == 0) {
      return std::max(64 / sizeof(T), size_type(1));
    }
    if (capacity() < folly::jemallocMinInPlaceExpandable / sizeof(T)) {
      return capacity() * 2;
    }
    if (capacity() > 4096 * 32 / sizeof(T)) {
      return capacity() * 2;
    }
    return (capacity() * 3 + 1) / 2;
  }

It hasn't changed much, but note that this function only computes the "increased capacity on push back". This is then used in:

  size_type computeInsertCapacity(size_type n) {
    size_type nc = std::max(computePushBackCapacity(), size() + n);
    size_type ac = folly::goodMallocSize(nc * sizeof(T)) / sizeof(T);
    return ac;
  }

and in emplace_back:

template <typename T, typename Allocator>
template <class... Args>
void fbvector<T, Allocator>::emplace_back_aux(Args&&... args) {
  size_type byte_sz = folly::goodMallocSize(
    computePushBackCapacity() * sizeof(T));
   // ...
}

to predict the future capacity, but goodMallocSize has the last say on this. ~I can't find the source of goodMallocSize, so I don't know if it calls nallocx, usable_capacity, or does something else.~ EDIT, found the source of goodMallocSize:

inline size_t goodMallocSize(size_t minSize) noexcept {
  if (minSize == 0) {
    return 0;
  }

  if (!usingJEMalloc()) {
    // Not using jemalloc - no smarts
    return minSize;
  }

  return nallocx(minSize, 0);
}

// We always request "good" sizes for allocation, so jemalloc can
// never grow in place small blocks; they're already occupied to the
// brim.  Blocks larger than or equal to 4096 bytes can in fact be
// expanded in place, and this constant reflects that.
static const size_t jemallocMinInPlaceExpandable = 4096;

So currently, fbvector always calls nallocx. @Gankro thoughts?


@steveklabnik neither RawVec nor RawVec::double are stabilized, and std::Vec documentation says:

Vec does not guarantee any particular growth strategy when reallocating when full, nor when reserve is called. The current strategy is basic and it may prove desirable to use a non-constant growth factor. Whatever strategy is used will of course guarantee O(1) amortized push.

So as long as we still guarantee O(1) amortized push we can continue to improve this.


EDIT: IMO it makes sense to rename double to grow to indicate that we grow the RawVec, and not necessarily double it, and then to switch the default capacity to 1.5, measure, use a better heuristic for small and large allocations while leaving the 1.5 heurisitc for medium sized allocations, measure, pass alignment and capacity information to the allocator to check how much memory we actually get and try to reuse most of that, measure.

Gankra commented 7 years ago

Yes this will always be freely changeable. Someone just needs to put in the legwork to prove it's actually profitable. Keeping in mind many platforms don't use jemalloc, and that number will probably only be going down with time.

nagisa commented 7 years ago

Would certainly be more plausible to do it on a per-platform basis. Figure out optimal, in some sense, growth factors for T1 platforms and just decide on a sane default for the rest.

On Sep 9, 2017 11:57 PM, "Alexis Beingessner" notifications@github.com wrote:

Yes this will always be freely changeable. Someone just needs to put in the legwork to prove it's actually profitable. Keeping in mind many platforms don't use jemalloc, and that number will probably only be going down with time.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rust-lang/rust/issues/29931#issuecomment-328303020, or mute the thread https://github.com/notifications/unsubscribe-auth/AApc0qYjqjODqQy5dKzJQa6NpvGnTF5fks5sgvvEgaJpZM4GlqS9 .

gnzlbg commented 7 years ago

I am working on this. If anyone interested and with experience would like to mentor me please let me know.

mratsim commented 7 years ago

@gnzlbg The optimal growth factor is probably the golden ratio Φ (1.618...) as discussed here https://github.com/facebook/folly/issues/543 and in depth in this blog post.

gnzlbg commented 7 years ago

Optimal in what Sense?

Note that one property of a good growth factor is that you are able to reuse some previously freed memory.

With 1.45 this happens after 3 reallocations, with 1.5 after 4, and in this PR we have something slightly higher than 1.5. If you don't reallocate 4 times a growth factor of 1.5 doesn't allow you to reuse any memory.

The golden ratio is indeed optimal in a particular sense, but the most efficient growth ratio is always going to be application dependent. I (and many others) think 3 / 2 is a good trade off between easy to compute (no floating point arithmetic necessary), low number of reallocations necessary to be profitable, and close enough to the golden ratio.

Still, it can't be the best for all applications.

On Fri 10. Nov 2017 at 19:20, Mamy Ratsimbazafy notifications@github.com wrote:

@gnzlbg https://github.com/gnzlbg The optimal growth factor is probably the golden ratio Φ (1.618...) as discussed here facebook/folly#543 https://github.com/facebook/folly/issues/543 and in depth in this blog post https://crntaylor.wordpress.com/2011/07/15/optimal-memory-reallocation-and-the-golden-ratio/ .

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/rust-lang/rust/issues/29931#issuecomment-343547935, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3NprUhVqXf1bAc10r5W81vn3RBTGP2ks5s1JPdgaJpZM4GlqS9 .

nnethercote commented 6 years ago

I strongly advocate for a doubling strategy, as opposed to 1.5x, or the golden ratio, or any other number.

As I understand them, the arguments in favour of smaller values are based on a mistaken assumption. Let's assume we use doubling and start at 8 bytes, and then go 16, 32, 64, 128, 256, 512, 1024. For the 8..512 allocations we will have used 1016 bytes, and so we can't reuse those bytes for the 1024 bytes. With a 1.5x growth factor we would theoretically be able to reuse and combine multiple previous freed allocations to satisfy subsequent requests.

But in practice we can't, because the above argument assumes in the 1.5x case that these smaller allocations may be laid out in such a way that they can be combined into larger allocations, which is almost certainly not true. jemalloc rounds up requests to various size classes, and allocations of different size classes are put into different runs, where runs are measured in pages; so it is guaranteed that not a single one of those allocations in the 8..512 sequence will be adjacent, and there will be no chance of combining them for such reuse. Other allocators may behave differently, but it would be a very unusual allocator that would lay out such sequences contiguously.

So with that mistaken argument out of the way, we can see the arguments in favour of 2x: it's super simple and it's super fast to compute.

gnzlbg commented 6 years ago

But in practice we can't, because the above argument assumes in the 1.5x case that these smaller allocations may be laid out in such a way that they can be combined into larger allocations, which is almost certainly not true. jemalloc rounds up requests to various size classes, and allocations of different size classes are put into different runs, where runs are measured in pages; so it is guaranteed that not a single one of those allocations in the 8..512 sequence will be adjacent,

Which is why the proposed strategy doubles up the memory up to the page size [*], and then switches to 1.5x.

we can see the arguments in favour of 2x: it's super simple and it's super fast to compute.

A better growth-factor can 1) reduce the number of jemalloc calls, 2) reduce the time spent in these calls, and 3) reduce memory usage. These all can have a measurable performance impact.

But whether computing the growth factor takes 1 cycle or 20 is irrelevant compared to the cost of a single jemalloc call: nallocx alone is already over 50 cycles.

[*] Which is arguably too simple. We could probably do better than this by rounding to the next power of two, so that instead of copying the memory on growth within the same memory pool (such that the older memory can't be reused) jemalloc would instead copy the memory into the next pool freeing the old memory.

glandium commented 6 years ago

Above the page size, the assumption is still bogus, at least in the case of jemalloc: If you have a 4K allocation and realloc() it to 8K, and there are 8K available directly next to the original 4K, jemalloc is not going to allocate the 8K, copy the data and free the original 4K: it's just going to expand the original 4K to 8K, because there's 4K free right after it. But this assumes you're using realloc, which I guess Vec does. Now, the big difference with C++ vectors is that C++ vectors may semantically need an actual malloc/copy/free sequence, where copy uses the copy constructor for the type stored in the vector. And in that case, the assumption does apply. Now, the real argument is that past a certain size, doubling can be considered too much memory overhead.

gnzlbg commented 6 years ago

Above the page size, the assumption is still bogus, at least in the case of jemalloc:

Yes, this is why the proposed strategy only uses 1.5x for medium sized vectors and switches the strategy again for "large" vectors (>= 32 * 4096 bytes), currently back to doubling.

But this assumes you're using realloc, which I guess Vec does.

Yes, Vec currently uses realloc but with a growth factor of 2 which means that this never happens. There is a PR that uses 1.5x for medium sized vectors and switches from realloc to realloc_excess, which in practice means that medium sized vectors will always be using a multiple of the page size.

Now, the real argument is that past a certain size, doubling can be considered too much memory overhead.

I think that we might be able to do better than doubling for "large" vectors by tweaking the growth factor, but not much better. The PR switches back to doubling here because this is what we are currently doing. Is there any literature or recommendations from jemalloc / malloc about this?

Anyways my plan for large vectors is not to tweak the growth factor between choosing either 1.5 or 2. The main problem with "very large" vectors is the cost of allocating new memory, copying memory (they are very large), and freeing the old memory.

On any of the major 64 bit platforms (Linux, Windows, MacOs, ...) we can reduce the cost of this operations to ~0 for very large vectors by, e.g., once a vector becomes "very large" allocating 4 Gb of virtual memory (e.g. using VirtualAlloc on Windows, malloc or mmap /dev/zero on Linux, ...). That way the vector can keep growing in place, using only as much memory as it needs (as long as it doesn't touch a memory page it will not use it), without ever reallocating, copying memory, invalidating pointers, ...

I think that this will have a much larger impact on performance than tweaking the growth factor.

glandium commented 6 years ago

Actually, C++ vectors always need an actual malloc/copy/free sequence, because

Yes, Vec currently uses realloc but with a growth factor of 2 which means that this never happens.

There is no reason for this not to happen with jemalloc, even with larger factors, for sizes between one page and the maximum size for large allocations (chunk size minus a few pages). Except, obviously, if the memory following the current buffer is not free, in which case the original assumption doesn't apply anyways.

gnzlbg commented 6 years ago

Actually, C++ vectors always need an actual malloc/copy/free sequence, because

I don't follow the point you are trying to make. Yes, C++ vectors need to do this, but Rust's Vec does not, or what am I missing?

There is no reason for this not to happen with jemalloc, even with larger factors, for sizes between one page and the maximum size for large allocations (chunk size minus a few pages).

Maybe we are talking past each other and you meant something different, but:

Suppose you have a contiguous chunk of the address space that is free (before and after this chunk the memory is not free), and that fits the initial content of a vector plus 4x vector regrowths.

Now suppose that after the 4th regrowth, the first part of the buffer is still free.

The difference between a growth-factor of 1.5x and 2x is that the vector with a growth-factor of 1.5x can grow a fifth time inside that buffer while the one with a growth-factor of 2x never can. This is because, for a growth-factor of 1.5, the size of the previously freed allocations in the buffer is exactly 1.5x the size of the 4th vector, while for a growth-factor of 2x, the size of the previously-freed allocations is always smaller than 2x the size of the 4th vector.

Except, obviously, if the memory following the current buffer is not free, in which case the original assumption doesn't apply anyways.

So if the memory following the current buffer is not free, anything better than the worst theoretically-possible growth-factor allows you to grow the vector one extra time before having to go look for a larger buffer.

But even if the memory following the current buffer is free, reusing the previously-freed memory reduces fragmentation, and can potentially reduce the number of jemalloc calls as well (if the free size after the 4th vector is returned as an Excess).

Or where you talking about a completely different situation?

gnzlbg commented 6 years ago

Now that we are moving to the System allocator by default, I think it makes sense to at least take a look at what the C++ vector implementations do w.r.t the growth factor on the different platforms, since it could make sense that the platforms tune their growth factors for their allocators:

libc++

System implementation on MacOSX and some *BSDs, code:

//  Precondition:  __new_size > capacity()
template <class _Tp, class _Allocator>
inline _LIBCPP_INLINE_VISIBILITY
typename vector<_Tp, _Allocator>::size_type
vector<_Tp, _Allocator>::__recommend(size_type __new_size) const
{
    const size_type __ms = max_size();
    if (__new_size > __ms)
        this->__throw_length_error();
    const size_type __cap = capacity();
    if (__cap >= __ms / 2)
        return __ms;
    return _VSTD::max<size_type>(2*__cap, __new_size);
}

The growth factor is 2. Note that on *BSDs jemalloc is the default system allocator, although FBVector is tuned for jemalloc and does a couple of things differently as mentioned above (2 for tiny and large vectors, and 1.5 for the cases in which the allocation hits jemalloc pools).

libstdc++

System implementation on Linux. The code is here:

// Called by _M_fill_insert, _M_insert_aux etc.
size_type
_M_check_len(size_type __n, const char* __s) const
{
    if (max_size() - size() < __n)
      __throw_length_error(__N(__s));

    const size_type __len = size() + std::max(size(), __n);
    return (__len < size() || __len > max_size()) ? max_size() : __len;
}

The growth factor is 2 (size + size).

MSVC

Windows. I couldn't find the code only anywhere, but all internet sources point to a factor of 1.5. This factor is also used by FBVector, Boost, etc.


Ok so what is going on here?

Well, Windows is the outlier, and that makes sense! Windows does not have overcommit, so all the memory Vec reserves is physical memory. When STL, a std lib maintainer at Microsoft, tried to changed it people complained that they preferred the lower memory consumption.

On the other hand, for Linux and MacOSX, it might well be that 2 is more tuned for the pools that their allocators use internally. For large allocations (larger than a couple of page sizes), however, it doesn't really matter because the pages won't be committed to physical memory until they are touched. Using the larger 2x factor is a problem on 32-bit systems, but it is basically free on 64-bit systems and maybe that's what libstdc++ and libc++ are tuned for?

Hell, if one wouldn't be able to disable overcommit in these systems we could just have allocations of large enough vector on 64-bit systems just jump to reserving XGb of RAM and continue with a 2x factor after that. That basically allows realloc to growth the vector in place without any memcpys.

So IMO, if we are going to change this, I'll say change this to whatever each system does. That is, 1.5 on Windows, and 2x on Linux and MacOSX. For targets that have jemalloc as a system allocator (newer Androids, *BSDs, etc.) we could consider doing what FBVector does, but IMO we should discuss that on a different issue that is specific for those systems.

leventov commented 5 years ago

Windows does not have overcommit, so all the memory Vec reserves is physical memory.

I think these are two unrelated things. Windows might not allow to overcommit, but virtual memory is still virtual. According to this article there have been some substantial improvements to how does it manage memory in Windows 8.1 and Windows Server 2012 R2, both released in 2013. The comment by STL:

Yes - but the last time I mumbled about possibly changing this, many people said they preferred the more conservative space consumption (it's not a big difference, but it's there). I'm not terribly inclined to mess with it.

Was made on Aug 30 2014 and the "last time" mentioned could well be several years before that, so likely prior to the improvements.

leventov commented 5 years ago

On the other hand, for Linux and MacOSX, it might well be that 2 is more tuned for the pools that their allocators use internally. For large allocations (larger than a couple of page sizes), however, it doesn't really matter because the pages won't be committed to physical memory until they are touched. Using the larger 2x factor is a problem on 32-bit systems, but it is basically free on 64-bit systems and maybe that's what libstdc++ and libc++ are tuned for?

The factor itself (is it 2 or a different number) couldn't be tuned for allocator alone because it doesn't take the size of the element into consideration. The capacities could always be multiples of two, but if the size of the element is 12 bytes, it's always off. Compare with FBVector's initial capacity of max(CACHE_LINE_SIZE / size_of_elem, 1).

I think it's as important not to overestimate the amount of thought put into old systems code as not to underestimate it. I might be wrong but I have an impression that it's pretty much impossible to push any changes to those policies such as vector's growth factor, hash table load factor, etc. in C++ standard library implementations because some users may depend on the existing behaviour and the priority is avoiding regressions rather than improving the average. That's probably why companies like Facebook and Google, albeit having significant power in GCC / Clang development, come along with their independent implementations. Rust still has the luxury (but probably not for long) to reconsider the old defaults and I think it should take advantage of it.

That is, I think Rust should pick up the FBVector's strategy (or something similar) on all platforms.

gnzlbg commented 5 years ago

The capacities could always be multiples of two, but if the size of the element is 12 bytes, it's always off.

This doesn't really matter much if you always extend your vector to the usable_size.

The factor itself (is it 2 or a different number) couldn't be tuned for allocator alone because it doesn't take the size of the element into consideration.

If your allocator only has even-sized or power-of-two-sized bins, a growth factor of two for a vector could make sense, even if this growth factor doesn't take the element size into account.

leventov commented 5 years ago

If your allocator only has even-sized or power-of-two-sized bins, a growth factor of two for a vector could make sense, even if this growth factor doesn't take the element size into account.

If it starts with something like FBVector's 64 / sizeof(T), yes. If no (libc++ and libstdc++ implementations that you pointed to don't, as far as I can tell, their capacity chain is always 1 -> 2 -> 4 -> ...), I don't see how does it make sense, if sizeof(T) is not a power of two.

gnzlbg commented 5 years ago

If it starts with something like FBVector's 64 / sizeof(T), yes [...] I don't see how does it make sense, if sizeof(T) is not a power of two.

It's maybe a bit counter intuitive. Alloc::usable_size will always set the capacity of the vector at BIN_SIZE / sizeof(T), so that the vector always uses at least BIN_SIZE - sizeof(T) + 1 bytes of the allocation (if BIN_SIZE % sizeof(T) == 0 then it uses BIN_SIZE bytes). As the vector grows, BIN_SIZE grows as a function of the capacity, but sizeof(T) + 1 remains constantan

That is, a Vec that uses Alloc::usable_size() grows with factor * capacity ~= factor * BIN_SIZE, so sizeof(T) does not really play a role here.

Maybe it is a bit clearer with an example, let's suppose sizeof(T) == 13 (prime) and that our allocator only has power-of-two sizes. I'll directly jump to the cases that make this obvious:

let mut v = Vec::with_capacity(8); 
// requests 8 * 13 = 104 bytes, allocator bin:  128 bytes
// 128 / 13 = 9
assert_eq!(v.capacity(), 9);

v.extend(0..10); // push 10 elements to grow the vector
// requested capacity: 2 * capacity() = 2 * 9 = 18 => 234 bytes
// allocator bin: 256 bytes => 256 / 13 = 19
assert_eq!(v.capacity(), 19); 

v.extend(0..10); // push another 10 elements to grow the vector
// requested capacity: 2 * capacity() = 2 * 19 = 38 => 494 bytes
// allocator bin: 512 bytes => 512 / 13 = 39
assert_eq!(v.capacity(), 39);

// etc.

So that is what I meant with, if your Vec uses Alloc::usable_size, the exact growth-factor that you use is less "relevant". For this particular case, you can actually exploit this knowledge to don't need a growth factor at all.

Your allocator might not have size-classes, might lack an accurate Alloc::usable_size(), or might even be a real allocator like the one FBVector assumes (with accurate Alloc::usable_size() which FBVector uses, power-of-two size classes in a particular range, ... requiring FBVector to use multiple growth factors).

But even if you don't know anything about your allocator, a power-of-two size class assumptions makes sense. Many allocators use power-of-two size classes, and many types have a power-of-two size due to alignment requirements and fit it perfectly. If your vector starts with one such element and grows, it will do so perfectly, and if it doesn't fit a bin perfectly, it will always grow into the next bin (something that 1.5x growth-factor doesn't give you). A growth factor of 2 is also dirty cheap to compute.

If you have an accurate Alloc::usable_size, then the growth factor for small and medium allocations that fall in the allocators size classes doesn't really matter. As you move to large allocations, Alloc::usable_size just rounds up to the next multiple of a page size, and a factor of 1.5 grows less but more often, while a factor of 2 grows less often but more. Which one is best on that range really depends, but whether the arguments in favor of a 1.5 factor hold on that range (e.g. being able to reuse previously freed memory when growing), really depend on your allocator, and it is unclear to me that any allocator does actually anything clever enough to be helpful here. You might as well use a factor of 2 only with the hope of having to grow one time less than if you were using a factor of 1.5.

leventov commented 5 years ago

I should have clarified that I pointed to the lack of sizeof(T) consideration in std::vector implementations solely to make it more evident that those growth policies might be not results of deep thought or super fine tuning for specific allocators, but rather arbitrary decisions that were made decades ago and which nobody was able to or willed to change since. I mean, when you look at the relative complexity of the FBVector's growth policy, you clearly see that authors did think about that. std::vector double growth policy is simple, but it's likely not simplicity with "great wisdom behind it", so the 2.0 growth factor shouldn't receive additional credit because it appears in that std::vector impls.

mzabaluev commented 4 years ago

One case that would benefit if reserve_exact set the capacity to exactly the one requested is https://github.com/tokio-rs/bytes/issues/396

workingjubilee commented 1 year ago

The "use a 1.5 growth strategy" suggestion comes up a lot. However:

In practice, changes like this have been attempted, and yielded nontrivial perf regressions in the compiler:

As there is certainly not a shortage of Vecs in the compiler, it's reasonable to believe that this may not be a worthwhile improvement, as it trades off a potentially significant amount of runtime. However if someone can provide evidence otherwise, e.g. that the compiler is somehow distinct from every other common user of Vec, or if there is some conjoined changeset that makes this into an actual improvement, this remains open to change.

Folly's size growth code seems to be similar, today, to what it was, with no exceptional difference. The last fix was to a typo. I almost find that suspicious, as the past 7 years have certainly yielded a plethora of research on allocations and their efficiency, such as mimalloc.

https://github.com/facebook/folly/blob/1e6579890fe71387c14135080fc093045ab79897/folly/FBVector.h#L1141-L1171

AaronKutch commented 6 months ago

I found this from https://users.rust-lang.org/t/reserve-truly-exact-capacity-for-vec/109751/19 . Some parts of this discussion are saying things about making reserve_exact or some part of reservation actually exact, but this is simply not possible. The Drop impl on Vec calls the drop impl on RawVec, which ultimately needs a Layout that has the same number of bytes as the allocation was created with (allocators are allowed to assume that allocation length information is being preserved, and at the same time they are allowed to return more than requested, in fact some strategies absolutely must do this, If my understanding is correct). Some field of the RawVec needs to store this information, and that ends up being the raw capacity also used for the capacity. It would be possible to do some capacity virtualization by adding another usize, but this will never happen as it impacts the performance of almost everything. Implementations simply must apply their own virtualization as needed, and they must accept that exact real capacity is not possible.

AaronKutch commented 6 months ago

What actually should be done is that reserve_exact should be deprecated, the name itself can mislead people. In fact, it lead me to an exponential capacity blowup problem in one of my crates. All other related functions need to state up front that allocations and reservations can always allocate more than requested.