Open Gankra opened 8 years ago
cc @briansmith
cc me
The question at hand is: is this ever worth it to check usable_capacity, and is it ever undesirable?
I guess you mean s/usable_capacity/malloc_usable_size/.
If I invoke shrink_to_fit or with_capacity, it ought to not reallocate when I try to convert to a Box<[T]>
I think that's a "nice to have" but not very important. Is real-world code doing shrink_to_fit
and then into_boxed_slice
? I searched the uses of into_boxed_slice
and into_boxed_str
in the Rust and Servo repos and it looks like they are rarely used in those repos. I think using those functions is a code smell, actually.
Rust's Vec is basically the community's malloc/calloc,
I disagree with this. Vec
is the data structure you use by default, and it should have good performance by default. For example, its the thing I collect
iterators into by default. Some people might be using Vec
to simulate malloc
, but I think that's much more rare. I use with_capacity(N)
because I know I'm going to be pushing at least N
elements into the vector and I don't want to be wasteful; I'm noever using it because I want to create a boxed slice with N
elements in it. It makes sense to have a separate interface for people that want a "safe" malloc; perhaps that separate interface would just be additional stuff added to Vec
. But I think the behavior of the common (existing) interface should be geared towards more naive uses.
Rust's Vec is basically the community's malloc/calloc
That’s a hack to work around the lack of proper stable API, not how it should be.
Regardless of what we want to do now, we need to remove the "guarantee" in the nightly docs about capacity, which locks us in. We have no reason to promise all kinds of things about Vec internals.
CC @glandium
Maybe instead of actually using usable_size (which has a non-zero cost as you mentioned), the better thing to do would be to just round up to the nearest power-of-two (in bytes) for small allocations and to the nearest MiB for large ones, like mozillas nsTArray does.
I have also looked a bit at the code, and at least VecDeque rounds up to the nearest pot (in elements), probably because it does a + 1
which is like the most wasteful thing to do related to this issue.
The docs for VecDeque btw say it reserves space for "at least" that many elements, not exactly.
Btw, just grepping through the rustc code for with_capacity
reveals a few cases where it does a + 1
as well.
VecDeque needs to maintain its power of two capacity in order to support efficient modular arithmetic (just a bit mask). HashMap does the same.
FWIW, one can use constant-division tricks to do (relatively) efficient modular arithmetic, if the bit pattern is fixed (e.g. if the capacities are guaranteed to be either 1 << n
or 3 << n
). Of course, branching and then doing the multiplications/shifts for the latter case definitely isn't free.
(I believe x % (3 << n)
can be computed via something like:
x - ((x * C) >> (n + 1)) * (3 << n)
for some constant C
.)
Oh and they both keep some slack capacity for similar perf reasons. Hashmap will only use 91% of its cap because it keeps access times down. VecDeque keeps an empty slot so that the "two indices" representation can differentiate between the empty and full state without us maintaining some flag that constantly needs to be checked/maintained.
I would also like to slightly dissent on the point of Vec being a good malloc. It's an excellent memory-safe and exception-safe interface for acquiring, initializing, deinitializing, and releasing an allocation. It's a bad choice for lower-level uses (all the other collections), of course.
It's important to note that jemalloc isn't the only allocation that exhibits this behavior (having some slack), actually many do.
I assumed we had the trio: reserve, reserve_exact and shrink_to_fit; exactly in order to allow reserve to be highly optimized and do smart reallocation, like in this case.
I think you've got the wrong larsberg. I believe you want @larsbergstrom
hi,
I am learning rust and when I run the attached programs with parameter 100 million in Point size, Rust takes much more memory in vector. Rust compiler generated program (-O) takes 4.5 GB RAM whereas clang++ generated program (-O) takes 3GB.
I tried a swift program as well which tool 3 GB around (same as C++).
The time command output for rust and c++ respectively is:
[Rust 1.5]
MacBook-Pro:Pointrust kajal$ time cargo run --release 100000000
Running target/release/pointrust 100000000
-13091672652, 50607436937
real 0m25.189s user 0m15.511s sys 0m9.144s
[C++ compiled using clang++ with -O flag] MacBook-Pro:somedir kajal$ time ./a.out 100000000 Average is 1073876718, 1073753699
real 0m21.816s user 0m18.231s sys 0m3.167s
Queries:
@kajalsinha You might want to specify the capacity when you create the vector (see with_capacity
).
The memory consumption remains 1.5 times even with_capacity
@kajalsinha your C++ program is using long, Rust is using f64, right?
According to google, a long may be 32 bits, I expect f64 is 64 bits. If this is the case, you should be careful with any comparisons, integers are usually faster than floating point even for same size.
Caveat: I was never a C++ programmer, and google is occasionally wrong.
The C/C++ standards define long as a number with no less capacity than an int and no more capacity than a long long. That's it. On Mac OS X (which @kajalsinha is using, based on the console paste), it's one word long: 32 bits on 32-bit OSX and 64 bits on 64-bit OSX. For a fair comparison with Rust, you should use an isize
.
The fact that you're using integers in C and floating point numbers in Rust also accounts for the performance difference.
What is the point of even having reserve_exact(), if it provides no additional guarentees over reserve()?
@Storyyeller reserve
guarantees that cap
is greater than or equal to the specified amount. reserve_exact
guarantees that cap
is equal to the specified amount, that however does not mean the allocator can't give more memory (usable_size
) back.
that however does not mean the allocator can't give more memory (usable_size) back.
@ticki what does that mean if it doesn't mean that the capacity can be greater than requested?
I believe that reserve
and reserve_exact
make exactly the same guarantee, namely that you can push the specified number of elements onto the Vec
without any reallocation. The only difference is the strategy for reallocation. reserve
might try to reserve more space than requested to avoid reallocations if you push more elements than you requested. reserve_exact
will only try to reserve enough space for the extra elements. This is useful to reduce memory usage if you know exactly how many more elements you need to push or at least have an upper bound.
what does that mean if it doesn't mean that the capacity can be greater than requested?
To make bookkeeping of the memory more smooth, most allocators has some fixed sizes you can allocate from. If the given value is not one of these, it will round up to the nearest fitting size.
Right, but that doesn't mean that "reserve_exact
guarantees that cap is equal to the specified amount". In fact the docs say that that is not the case:
Note that the allocator may give the collection more space than it requests. Therefore capacity can not be relied upon to be precisely minimal.
@ollie27 No, the point is that the capacity is updated with the (uncanonicalised) memory returned by the allocator.
Closing. If we want to pursue this, a discussion (potentially leading to PRs) on internals.rust-lang.org would be best.
Actually, I'm going to reopen. This is something that has potential performance impacts and we should probably keep tracking.
I don't see the growth factor being discussed much here.
@Gankro mentions FBVector
and its growth factor of 1.5. A factor of 1.5 allows reusing previously freed memory after 4 reallocations, 1.45 allows memory reuse after 3 reallocations, and 1.3 allows reuse after 2 reallocations. Which one is better is application dependent, although some claim that 1.5 is close to optimal.
What seems clear is that a growth factor of 2, that is, doubling the size of the vector on reallocation, is the worst possible growth factor that can be chosen because it never allows the vector to reuse any previously freed memory. That is, for a vector that reallocates multiple times, a growth factor of 2 means that the memory freed from these reallocations will never be usable for a future allocation of the vector.
Since Rust currently uses a growth factor of 2, a small incremental step could be to try a smaller growth factor and see if that improves memory usage. Any one will do, although 1.3, 1.45 or 1.5 make sense to me.
From the fbvector
docs:
Of course, the above makes a number of simplifying assumptions about how the memory allocator works, but definitely you don't want to choose the theoretically absolute worst growth factor (2). fbvector uses a growth factor of 1.5. That does not impede good performance at small sizes because of the way fbvector cooperates with jemalloc (below).
Is this even something we can change at this point?
Also, I don't know if the code of fbvector.h has changed much, but the part cited by @Gankro now looks like this:
size_type computePushBackCapacity() const {
if (capacity() == 0) {
return std::max(64 / sizeof(T), size_type(1));
}
if (capacity() < folly::jemallocMinInPlaceExpandable / sizeof(T)) {
return capacity() * 2;
}
if (capacity() > 4096 * 32 / sizeof(T)) {
return capacity() * 2;
}
return (capacity() * 3 + 1) / 2;
}
It hasn't changed much, but note that this function only computes the "increased capacity on push back". This is then used in:
size_type computeInsertCapacity(size_type n) {
size_type nc = std::max(computePushBackCapacity(), size() + n);
size_type ac = folly::goodMallocSize(nc * sizeof(T)) / sizeof(T);
return ac;
}
and in emplace_back
:
template <typename T, typename Allocator>
template <class... Args>
void fbvector<T, Allocator>::emplace_back_aux(Args&&... args) {
size_type byte_sz = folly::goodMallocSize(
computePushBackCapacity() * sizeof(T));
// ...
}
to predict the future capacity, but goodMallocSize
has the last say on this. ~I can't find the source of ~ EDIT, found the source of goodMallocSize
, so I don't know if it calls nallocx
, usable_capacity
, or does something else.goodMallocSize
:
inline size_t goodMallocSize(size_t minSize) noexcept {
if (minSize == 0) {
return 0;
}
if (!usingJEMalloc()) {
// Not using jemalloc - no smarts
return minSize;
}
return nallocx(minSize, 0);
}
// We always request "good" sizes for allocation, so jemalloc can
// never grow in place small blocks; they're already occupied to the
// brim. Blocks larger than or equal to 4096 bytes can in fact be
// expanded in place, and this constant reflects that.
static const size_t jemallocMinInPlaceExpandable = 4096;
So currently, fbvector
always calls nallocx
. @Gankro thoughts?
@steveklabnik neither RawVec
nor RawVec::double
are stabilized, and std::Vec
documentation says:
Vec does not guarantee any particular growth strategy when reallocating when full, nor when reserve is called. The current strategy is basic and it may prove desirable to use a non-constant growth factor. Whatever strategy is used will of course guarantee O(1) amortized push.
So as long as we still guarantee O(1) amortized push we can continue to improve this.
EDIT: IMO it makes sense to rename double
to grow
to indicate that we grow the RawVec
, and not necessarily double it, and then to switch the default capacity to 1.5, measure, use a better heuristic for small and large allocations while leaving the 1.5 heurisitc for medium sized allocations, measure, pass alignment and capacity information to the allocator to check how much memory we actually get and try to reuse most of that, measure.
Yes this will always be freely changeable. Someone just needs to put in the legwork to prove it's actually profitable. Keeping in mind many platforms don't use jemalloc, and that number will probably only be going down with time.
Would certainly be more plausible to do it on a per-platform basis. Figure out optimal, in some sense, growth factors for T1 platforms and just decide on a sane default for the rest.
On Sep 9, 2017 11:57 PM, "Alexis Beingessner" notifications@github.com wrote:
Yes this will always be freely changeable. Someone just needs to put in the legwork to prove it's actually profitable. Keeping in mind many platforms don't use jemalloc, and that number will probably only be going down with time.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rust-lang/rust/issues/29931#issuecomment-328303020, or mute the thread https://github.com/notifications/unsubscribe-auth/AApc0qYjqjODqQy5dKzJQa6NpvGnTF5fks5sgvvEgaJpZM4GlqS9 .
I am working on this. If anyone interested and with experience would like to mentor me please let me know.
@gnzlbg The optimal growth factor is probably the golden ratio Φ
(1.618...) as discussed here https://github.com/facebook/folly/issues/543 and in depth in this blog post.
Optimal in what Sense?
Note that one property of a good growth factor is that you are able to reuse some previously freed memory.
With 1.45 this happens after 3 reallocations, with 1.5 after 4, and in this PR we have something slightly higher than 1.5. If you don't reallocate 4 times a growth factor of 1.5 doesn't allow you to reuse any memory.
The golden ratio is indeed optimal in a particular sense, but the most efficient growth ratio is always going to be application dependent. I (and many others) think 3 / 2 is a good trade off between easy to compute (no floating point arithmetic necessary), low number of reallocations necessary to be profitable, and close enough to the golden ratio.
Still, it can't be the best for all applications.
On Fri 10. Nov 2017 at 19:20, Mamy Ratsimbazafy notifications@github.com wrote:
@gnzlbg https://github.com/gnzlbg The optimal growth factor is probably the golden ratio Φ (1.618...) as discussed here facebook/folly#543 https://github.com/facebook/folly/issues/543 and in depth in this blog post https://crntaylor.wordpress.com/2011/07/15/optimal-memory-reallocation-and-the-golden-ratio/ .
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/rust-lang/rust/issues/29931#issuecomment-343547935, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3NprUhVqXf1bAc10r5W81vn3RBTGP2ks5s1JPdgaJpZM4GlqS9 .
I strongly advocate for a doubling strategy, as opposed to 1.5x, or the golden ratio, or any other number.
As I understand them, the arguments in favour of smaller values are based on a mistaken assumption. Let's assume we use doubling and start at 8 bytes, and then go 16, 32, 64, 128, 256, 512, 1024. For the 8..512 allocations we will have used 1016 bytes, and so we can't reuse those bytes for the 1024 bytes. With a 1.5x growth factor we would theoretically be able to reuse and combine multiple previous freed allocations to satisfy subsequent requests.
But in practice we can't, because the above argument assumes in the 1.5x case that these smaller allocations may be laid out in such a way that they can be combined into larger allocations, which is almost certainly not true. jemalloc rounds up requests to various size classes, and allocations of different size classes are put into different runs, where runs are measured in pages; so it is guaranteed that not a single one of those allocations in the 8..512 sequence will be adjacent, and there will be no chance of combining them for such reuse. Other allocators may behave differently, but it would be a very unusual allocator that would lay out such sequences contiguously.
So with that mistaken argument out of the way, we can see the arguments in favour of 2x: it's super simple and it's super fast to compute.
But in practice we can't, because the above argument assumes in the 1.5x case that these smaller allocations may be laid out in such a way that they can be combined into larger allocations, which is almost certainly not true. jemalloc rounds up requests to various size classes, and allocations of different size classes are put into different runs, where runs are measured in pages; so it is guaranteed that not a single one of those allocations in the 8..512 sequence will be adjacent,
Which is why the proposed strategy doubles up the memory up to the page size [*], and then switches to 1.5x.
we can see the arguments in favour of 2x: it's super simple and it's super fast to compute.
A better growth-factor can 1) reduce the number of jemalloc calls, 2) reduce the time spent in these calls, and 3) reduce memory usage. These all can have a measurable performance impact.
But whether computing the growth factor takes 1 cycle or 20 is irrelevant compared to the cost of a single jemalloc call: nallocx
alone is already over 50 cycles.
[*] Which is arguably too simple. We could probably do better than this by rounding to the next power of two, so that instead of copying the memory on growth within the same memory pool (such that the older memory can't be reused) jemalloc would instead copy the memory into the next pool freeing the old memory.
Above the page size, the assumption is still bogus, at least in the case of jemalloc: If you have a 4K allocation and realloc() it to 8K, and there are 8K available directly next to the original 4K, jemalloc is not going to allocate the 8K, copy the data and free the original 4K: it's just going to expand the original 4K to 8K, because there's 4K free right after it. But this assumes you're using realloc, which I guess Vec does. Now, the big difference with C++ vectors is that C++ vectors may semantically need an actual malloc/copy/free sequence, where copy uses the copy constructor for the type stored in the vector. And in that case, the assumption does apply. Now, the real argument is that past a certain size, doubling can be considered too much memory overhead.
Above the page size, the assumption is still bogus, at least in the case of jemalloc:
Yes, this is why the proposed strategy only uses 1.5x for medium sized vectors and switches the strategy again for "large" vectors (>= 32 * 4096
bytes), currently back to doubling.
But this assumes you're using realloc, which I guess Vec does.
Yes, Vec
currently uses realloc
but with a growth factor of 2 which means that this never happens. There is a PR that uses 1.5x
for medium sized vectors and switches from realloc
to realloc_excess
, which in practice means that medium sized vectors will always be using a multiple of the page size.
Now, the real argument is that past a certain size, doubling can be considered too much memory overhead.
I think that we might be able to do better than doubling for "large" vectors by tweaking the growth factor, but not much better. The PR switches back to doubling here because this is what we are currently doing. Is there any literature or recommendations from jemalloc / malloc about this?
Anyways my plan for large vectors is not to tweak the growth factor between choosing either 1.5 or 2. The main problem with "very large" vectors is the cost of allocating new memory, copying memory (they are very large), and freeing the old memory.
On any of the major 64 bit platforms (Linux, Windows, MacOs, ...) we can reduce the cost of this operations to ~0 for very large vectors by, e.g., once a vector becomes "very large" allocating 4 Gb of virtual memory (e.g. using VirtualAlloc
on Windows, malloc
or mmap /dev/zero
on Linux, ...). That way the vector can keep growing in place, using only as much memory as it needs (as long as it doesn't touch a memory page it will not use it), without ever reallocating, copying memory, invalidating pointers, ...
I think that this will have a much larger impact on performance than tweaking the growth factor.
Actually, C++ vectors always need an actual malloc/copy/free sequence, because
Yes, Vec currently uses realloc but with a growth factor of 2 which means that this never happens.
There is no reason for this not to happen with jemalloc, even with larger factors, for sizes between one page and the maximum size for large allocations (chunk size minus a few pages). Except, obviously, if the memory following the current buffer is not free, in which case the original assumption doesn't apply anyways.
Actually, C++ vectors always need an actual malloc/copy/free sequence, because
I don't follow the point you are trying to make. Yes, C++ vectors need to do this, but Rust's Vec
does not, or what am I missing?
There is no reason for this not to happen with jemalloc, even with larger factors, for sizes between one page and the maximum size for large allocations (chunk size minus a few pages).
Maybe we are talking past each other and you meant something different, but:
Suppose you have a contiguous chunk of the address space that is free (before and after this chunk the memory is not free), and that fits the initial content of a vector plus 4x vector regrowths.
Now suppose that after the 4th regrowth, the first part of the buffer is still free.
The difference between a growth-factor of 1.5x and 2x is that the vector with a growth-factor of 1.5x can grow a fifth time inside that buffer while the one with a growth-factor of 2x never can. This is because, for a growth-factor of 1.5, the size of the previously freed allocations in the buffer is exactly 1.5x the size of the 4th vector, while for a growth-factor of 2x, the size of the previously-freed allocations is always smaller than 2x the size of the 4th vector.
Except, obviously, if the memory following the current buffer is not free, in which case the original assumption doesn't apply anyways.
So if the memory following the current buffer is not free, anything better than the worst theoretically-possible growth-factor allows you to grow the vector one extra time before having to go look for a larger buffer.
But even if the memory following the current buffer is free, reusing the previously-freed memory reduces fragmentation, and can potentially reduce the number of jemalloc calls as well (if the free size after the 4th vector is returned as an Excess
).
Or where you talking about a completely different situation?
Now that we are moving to the System allocator by default, I think it makes sense to at least take a look at what the C++ vector implementations do w.r.t the growth factor on the different platforms, since it could make sense that the platforms tune their growth factors for their allocators:
System implementation on MacOSX and some *BSDs, code:
// Precondition: __new_size > capacity()
template <class _Tp, class _Allocator>
inline _LIBCPP_INLINE_VISIBILITY
typename vector<_Tp, _Allocator>::size_type
vector<_Tp, _Allocator>::__recommend(size_type __new_size) const
{
const size_type __ms = max_size();
if (__new_size > __ms)
this->__throw_length_error();
const size_type __cap = capacity();
if (__cap >= __ms / 2)
return __ms;
return _VSTD::max<size_type>(2*__cap, __new_size);
}
The growth factor is 2. Note that on *BSDs jemalloc is the default system allocator, although FBVector is tuned for jemalloc and does a couple of things differently as mentioned above (2 for tiny and large vectors, and 1.5 for the cases in which the allocation hits jemalloc pools).
System implementation on Linux. The code is here:
// Called by _M_fill_insert, _M_insert_aux etc.
size_type
_M_check_len(size_type __n, const char* __s) const
{
if (max_size() - size() < __n)
__throw_length_error(__N(__s));
const size_type __len = size() + std::max(size(), __n);
return (__len < size() || __len > max_size()) ? max_size() : __len;
}
The growth factor is 2 (size + size).
Windows. I couldn't find the code only anywhere, but all internet sources point to a factor of 1.5. This factor is also used by FBVector, Boost, etc.
Ok so what is going on here?
Well, Windows is the outlier, and that makes sense! Windows does not have overcommit, so all the memory Vec
reserves is physical memory. When STL, a std lib maintainer at Microsoft, tried to changed it people complained that they preferred the lower memory consumption.
On the other hand, for Linux and MacOSX, it might well be that 2
is more tuned for the pools that their allocators use internally. For large allocations (larger than a couple of page sizes), however, it doesn't really matter because the pages won't be committed to physical memory until they are touched. Using the larger 2x factor is a problem on 32-bit systems, but it is basically free on 64-bit systems and maybe that's what libstdc++ and libc++ are tuned for?
Hell, if one wouldn't be able to disable overcommit in these systems we could just have allocations of large enough vector on 64-bit systems just jump to reserving XGb of RAM and continue with a 2x factor after that. That basically allows realloc to growth the vector in place without any memcpy
s.
So IMO, if we are going to change this, I'll say change this to whatever each system does. That is, 1.5 on Windows, and 2x on Linux and MacOSX. For targets that have jemalloc
as a system allocator (newer Androids, *BSDs, etc.) we could consider doing what FBVector does, but IMO we should discuss that on a different issue that is specific for those systems.
Windows does not have overcommit, so all the memory Vec reserves is physical memory.
I think these are two unrelated things. Windows might not allow to overcommit, but virtual memory is still virtual. According to this article there have been some substantial improvements to how does it manage memory in Windows 8.1 and Windows Server 2012 R2, both released in 2013. The comment by STL:
Yes - but the last time I mumbled about possibly changing this, many people said they preferred the more conservative space consumption (it's not a big difference, but it's there). I'm not terribly inclined to mess with it.
Was made on Aug 30 2014 and the "last time" mentioned could well be several years before that, so likely prior to the improvements.
On the other hand, for Linux and MacOSX, it might well be that 2 is more tuned for the pools that their allocators use internally. For large allocations (larger than a couple of page sizes), however, it doesn't really matter because the pages won't be committed to physical memory until they are touched. Using the larger 2x factor is a problem on 32-bit systems, but it is basically free on 64-bit systems and maybe that's what libstdc++ and libc++ are tuned for?
The factor itself (is it 2 or a different number) couldn't be tuned for allocator alone because it doesn't take the size of the element into consideration. The capacities could always be multiples of two, but if the size of the element is 12 bytes, it's always off. Compare with FBVector's initial capacity of max(CACHE_LINE_SIZE / size_of_elem, 1)
.
I think it's as important not to overestimate the amount of thought put into old systems code as not to underestimate it. I might be wrong but I have an impression that it's pretty much impossible to push any changes to those policies such as vector's growth factor, hash table load factor, etc. in C++ standard library implementations because some users may depend on the existing behaviour and the priority is avoiding regressions rather than improving the average. That's probably why companies like Facebook and Google, albeit having significant power in GCC / Clang development, come along with their independent implementations. Rust still has the luxury (but probably not for long) to reconsider the old defaults and I think it should take advantage of it.
That is, I think Rust should pick up the FBVector's strategy (or something similar) on all platforms.
The capacities could always be multiples of two, but if the size of the element is 12 bytes, it's always off.
This doesn't really matter much if you always extend your vector to the usable_size
.
The factor itself (is it 2 or a different number) couldn't be tuned for allocator alone because it doesn't take the size of the element into consideration.
If your allocator only has even-sized or power-of-two-sized bins, a growth factor of two for a vector could make sense, even if this growth factor doesn't take the element size into account.
If your allocator only has even-sized or power-of-two-sized bins, a growth factor of two for a vector could make sense, even if this growth factor doesn't take the element size into account.
If it starts with something like FBVector's 64 / sizeof(T)
, yes. If no (libc++ and libstdc++ implementations that you pointed to don't, as far as I can tell, their capacity chain is always 1 -> 2 -> 4 -> ...), I don't see how does it make sense, if sizeof(T)
is not a power of two.
If it starts with something like FBVector's 64 / sizeof(T), yes [...] I don't see how does it make sense, if sizeof(T) is not a power of two.
It's maybe a bit counter intuitive. Alloc::usable_size
will always set the capacity of the vector at BIN_SIZE / sizeof(T)
, so that the vector always uses at least BIN_SIZE - sizeof(T) + 1
bytes of the allocation (if BIN_SIZE % sizeof(T) == 0
then it uses BIN_SIZE
bytes). As the vector grows, BIN_SIZE
grows as a function of the capacity, but sizeof(T) + 1
remains constantan
That is, a Vec
that uses Alloc::usable_size()
grows with factor * capacity ~= factor * BIN_SIZE
, so sizeof(T)
does not really play a role here.
Maybe it is a bit clearer with an example, let's suppose sizeof(T) == 13
(prime) and that our allocator only has power-of-two sizes. I'll directly jump to the cases that make this obvious:
let mut v = Vec::with_capacity(8);
// requests 8 * 13 = 104 bytes, allocator bin: 128 bytes
// 128 / 13 = 9
assert_eq!(v.capacity(), 9);
v.extend(0..10); // push 10 elements to grow the vector
// requested capacity: 2 * capacity() = 2 * 9 = 18 => 234 bytes
// allocator bin: 256 bytes => 256 / 13 = 19
assert_eq!(v.capacity(), 19);
v.extend(0..10); // push another 10 elements to grow the vector
// requested capacity: 2 * capacity() = 2 * 19 = 38 => 494 bytes
// allocator bin: 512 bytes => 512 / 13 = 39
assert_eq!(v.capacity(), 39);
// etc.
So that is what I meant with, if your Vec
uses Alloc::usable_size
, the exact growth-factor that you use is less "relevant". For this particular case, you can actually exploit this knowledge to don't need a growth factor at all.
Your allocator might not have size-classes, might lack an accurate Alloc::usable_size()
, or might even be a real allocator like the one FBVector
assumes (with accurate Alloc::usable_size()
which FBVector
uses, power-of-two size classes in a particular range, ... requiring FBVector
to use multiple growth factors).
But even if you don't know anything about your allocator, a power-of-two size class assumptions makes sense. Many allocators use power-of-two size classes, and many types have a power-of-two size due to alignment requirements and fit it perfectly. If your vector starts with one such element and grows, it will do so perfectly, and if it doesn't fit a bin perfectly, it will always grow into the next bin (something that 1.5x growth-factor doesn't give you). A growth factor of 2 is also dirty cheap to compute.
If you have an accurate Alloc::usable_size
, then the growth factor for small and medium allocations that fall in the allocators size classes doesn't really matter. As you move to large allocations, Alloc::usable_size
just rounds up to the next multiple of a page size, and a factor of 1.5 grows less but more often, while a factor of 2 grows less often but more. Which one is best on that range really depends, but whether the arguments in favor of a 1.5 factor hold on that range (e.g. being able to reuse previously freed memory when growing), really depend on your allocator, and it is unclear to me that any allocator does actually anything clever enough to be helpful here. You might as well use a factor of 2
only with the hope of having to grow one time less than if you were using a factor of 1.5
.
I should have clarified that I pointed to the lack of sizeof(T)
consideration in std::vector
implementations solely to make it more evident that those growth policies might be not results of deep thought or super fine tuning for specific allocators, but rather arbitrary decisions that were made decades ago and which nobody was able to or willed to change since. I mean, when you look at the relative complexity of the FBVector
's growth policy, you clearly see that authors did think about that. std::vector
double growth policy is simple, but it's likely not simplicity with "great wisdom behind it", so the 2.0 growth factor shouldn't receive additional credit because it appears in that std::vector
impls.
One case that would benefit if reserve_exact
set the capacity to exactly the one requested is https://github.com/tokio-rs/bytes/issues/396
The "use a 1.5 growth strategy" suggestion comes up a lot. However:
In practice, changes like this have been attempted, and yielded nontrivial perf regressions in the compiler:
As there is certainly not a shortage of Vecs in the compiler, it's reasonable to believe that this may not be a worthwhile improvement, as it trades off a potentially significant amount of runtime. However if someone can provide evidence otherwise, e.g. that the compiler is somehow distinct from every other common user of Vec, or if there is some conjoined changeset that makes this into an actual improvement, this remains open to change.
Folly's size growth code seems to be similar, today, to what it was, with no exceptional difference. The last fix was to a typo. I almost find that suspicious, as the past 7 years have certainly yielded a plethora of research on allocations and their efficiency, such as mimalloc.
I found this from https://users.rust-lang.org/t/reserve-truly-exact-capacity-for-vec/109751/19 . Some parts of this discussion are saying things about making reserve_exact
or some part of reservation actually exact, but this is simply not possible. The Drop
impl on Vec
calls the drop impl on RawVec
, which ultimately needs a Layout
that has the same number of bytes as the allocation was created with (allocators are allowed to assume that allocation length information is being preserved, and at the same time they are allowed to return more than requested, in fact some strategies absolutely must do this, If my understanding is correct). Some field of the RawVec
needs to store this information, and that ends up being the raw capacity also used for the capacity. It would be possible to do some capacity virtualization by adding another usize
, but this will never happen as it impacts the performance of almost everything. Implementations simply must apply their own virtualization as needed, and they must accept that exact real capacity is not possible.
What actually should be done is that reserve_exact
should be deprecated, the name itself can mislead people. In fact, it lead me to an exponential capacity blowup problem in one of my crates. All other related functions need to state up front that allocations and reservations can always allocate more than requested.
Background
Currently, Vec's documentation frequently contains the caveat that "more space may be reserved than requested". This is primarily in response to the fact that jemalloc (or any other allocator) can actually reserve more space than you requested because it relies on fixed size-classes to more effeciently dole out memory. (see the table here)
Jemalloc itself exposes a
malloc_usable_size
function which can be used to determine how much capacity was actually allocated for a pointer, as well asnallocx
which can be used to determine how much capacity will be allocated for a given size and alignment. Vec can in principle query one of these methods and update its capacity field with the result.The question at hand is: _is this ever worth it to check
usable_capacity
, and is it ever undesirable?_This issue was kicked off by Firefox devs who have experience doing exactly this optimization, and claim it being profitable. Facebook has also claimed excellent dividends in making its allocation strategy more jemalloc friendly, but to my knowledge do not actually query usable_capacity.
Currently our alloction strategy is almost completely naive. One can see the source here, which boils down to:
To the best of my knowledge, this is all the capacity logic for FBVector, which boils down to:
The only major deviation from Rust today being a 1.5 growth factor for "moderate" sized allocations. Note that there is a path in their small_vector type that queries malloc_usable_size.
Unfortunately I've been unable to find good information on what Firefox does here. The best I could find is this discussion which demonstrates big wins -- 6.5% less calls to malloc and 11% less memory usage. However the effort seems to peter out when it is asserted that this is obsoleted by Gecko just using power-of-two capacities. mxr and dxr seem to suggest it's only used for statistics.
Hopefully Firefox and Facebook devs can chime in on any experiences.
Going Forward
I have several outstanding concerns before we push further on this matter.
Rust is not C++, so I am partially suspicious of any benchmarks that demonstrate value in C++. In particular, the proliferation of
size_hint
andextend
could completely blast away any need to be clever with capacities for most programs. I would also expect a large Rust application to frob the allocator less in general just because we default to move semantics, and don't have implicit assignment/copy constructors. Also, Rust programs are much more "reckless" with just passing around pointers into buffers because the borrow checker will always catch misuse. Slices in particular make this incredibly ergonomic. We need Rust-specific benchmarks. I'm sure the Hyper, Servo, Glium, Serde, and Rust devs all have some interesting workloads to look at.Rust's Vec is basically the community's
malloc
/calloc
, because actual malloc is unstable and unsafe. We explicitly support extracting the Vec's pointer and taking control of the allocation. In that regard I believe it is desirable for it to be maximally well-behaved and performant for "good" cases (knowing exactly the capacity you want, having no use for extra capacity). There's a certain value in requesting a certain capacity, and getting the capacity. Bothnallocx
andmalloc_usable_size
are virtual function calls with non-trivial logic, and may have unacceptable overhead for responsible users of Vec.Note that anything we do must enable
Vec
andBox<[T]>
to reliably roundtrip through each other without reallocating in certain cases. If I invokeshrink_to_fit
orwith_capacity
, it ought to not reallocate when I try to convert to aBox<[T]>
. As far as I can tell, this should be possible to uphold even when usingmalloc_usable_size
because jemalloc is "fuzzy" and only requires that the given size is somewhere between the requested one and the usable one.Anything we do must also work with allocators that aren't jemalloc. This may be as simple as setting
usable_size = requested_size
for every allocator that isn't jemalloc.CC @pnkfelix @erickt @reem @seanmonstar @tomaka @SimonSapin @larsberg @pcwalton @Swatinem @nnethercote
CC #29848 #29847 #27627
CC @rust-lang/libs
EDIT(workingjubilee, 2023-05-06): Originally one of the links touched on the master branch of folly instead of a permalink to a specific commit. Linkrot claimed the one to jemalloc's docs. These have been updated with a permalink and a Wayback Machine link, respectively.