Currently, creating a new Slab allocates a shards array of Shard
structs. The Shard struct itself owns two boxed arrays of local and
shared metadata for each page on that shard. Even though we don't
allocate the actual storage arrays for those pages until they are
needed, allocating the shard metadata eagerly means that a completely
empty slab results in a fairly large memory allocation up front. This is
especially the case when used with the default Config, which (on
64-bit machines) allows up to 4096 threads. On a 64-bit machine, the
Shared page metadata is 4 words, so 32 bytes, and the Local metadata
is another word. 33 bytes 32 pages per shard = 1056 bytes, which is a
little over 1kb per shard. This means that the default config eagerly
allocates 4096 shards 1056 bytes is about 4mb of metadata, even when
the program only has one or two threads in it. and the remaining
4000-some possible threads will never allocate their shards.
When most of the shards are empty because there are very few threads in
the program, most of this allocated memory is not resident, and gets
paged out by the operating system, but it results in a very surprising
amount of allocated virtual memory. This is the cause of issues like
tokio-rs/tracing#1005.
Furthermore, allocating all of this means that actually constructing a
slab takes a pretty long time. In tracing-subscriber, this is normally
not a major issue, since subscribers tend to be created on startup and
live for the entire lifetime of the program. However, in some use-cases,
like creating a separate subscriber for each test, the performance
impact of allocating all that metadata is quite significant. See, for
example:
https://github.com/rust-analyzer/rust-analyzer/issues/5792#issuecomment-676798623
This branch fixes this by allocating the shard metadata only when a new
shard is actually needed by a new thread. The shard array is now an
array of AtomicPtrs to shards, and shards are only allocated the first
time they are inserted to. Since each thread can only insert to its
own shard, the synchronization logic for this is fairly simple. However,
since the shards are morally, although not actually, owned by these
AtomicPtrs, there is the potential for leaks when a slab is dropped,
if we don't also ensure that all the shards it creates are also dropped.
Therefore, we use loom::alloc::Track for leak detection in tests.
Fortunately, the logic for ensuring these are deallocated is not too
complex.
Currently, creating a new
Slab
allocates a shards array ofShard
structs. TheShard
struct itself owns two boxed arrays of local and shared metadata for each page on that shard. Even though we don't allocate the actual storage arrays for those pages until they are needed, allocating the shard metadata eagerly means that a completely empty slab results in a fairly large memory allocation up front. This is especially the case when used with the defaultConfig
, which (on 64-bit machines) allows up to 4096 threads. On a 64-bit machine, theShared
page metadata is 4 words, so 32 bytes, and theLocal
metadata is another word. 33 bytes 32 pages per shard = 1056 bytes, which is a little over 1kb per shard. This means that the default config eagerly allocates 4096 shards 1056 bytes is about 4mb of metadata, even when the program only has one or two threads in it. and the remaining 4000-some possible threads will never allocate their shards.When most of the shards are empty because there are very few threads in the program, most of this allocated memory is not resident, and gets paged out by the operating system, but it results in a very surprising amount of allocated virtual memory. This is the cause of issues like tokio-rs/tracing#1005.
Furthermore, allocating all of this means that actually constructing a slab takes a pretty long time. In
tracing-subscriber
, this is normally not a major issue, since subscribers tend to be created on startup and live for the entire lifetime of the program. However, in some use-cases, like creating a separate subscriber for each test, the performance impact of allocating all that metadata is quite significant. See, for example: https://github.com/rust-analyzer/rust-analyzer/issues/5792#issuecomment-676798623This branch fixes this by allocating the shard metadata only when a new shard is actually needed by a new thread. The shard array is now an array of
AtomicPtr
s to shards, and shards are only allocated the first time they areinsert
ed to. Since each thread can only insert to its own shard, the synchronization logic for this is fairly simple. However, since the shards are morally, although not actually, owned by theseAtomicPtr
s, there is the potential for leaks when a slab is dropped, if we don't also ensure that all the shards it creates are also dropped. Therefore, we useloom::alloc::Track
for leak detection in tests. Fortunately, the logic for ensuring these are deallocated is not too complex.