Open overlookmotel opened 5 months ago
It's also possible to get Vec
down to 16 bytes without aligning all allocator chunks on 4 GiB, if can accept:
Vec
.Vec
.As follows:
capacity
field (5 bits can store a number 0 - 31).Vec
is limited to 134 million (1 << 27).Allocator
as separate field, but deduce it from data pointer + the top 5 bits of capacity
.Allocator
in its footer metadata block.struct AllocatorChunkFooter {
allocator: *const Allocator,
/* other fields */
}
struct Vec<T> {
ptr: NonNull<T>,
len: u32,
capacity: u32,
}
const CHUNK_ALIGNMENT_BITS: u32 = 5;
const CAPACITY_MASK: u32 = u32::MAX >> CHUNK_ALIGNMENT_BITS;
const CHUNK_ALIGNMENT_SHIFT: u32 = 32 - CHUNK_ALIGNMENT_BITS;
const ALLOCATOR_FIELD_MASK: usize = usize::MAX - std::mem::size_of::<AllocatorChunkFooter>() + 1
+ std::mem::offset_of!(AllocatorChunkFooter, allocator);
impl<T> Vec<T> {
pub fn len(&self) -> u32 { self.len }
pub fn capacity(&self) -> u32 { self.capacity & CAPACITY_MASK }
pub fn push(&mut self, value: T) {
if self.len() == self.capacity() { self.grow_for_push(); }
self.ptr.as_ptr().add(len).write(value);
self.len += 1;
}
#[cold]
fn grow_for_push(&mut self) {
let ptr = self.ptr.as_ptr() as *const u8;
let power = (self.capacity >> CHUNK_ALIGNMENT_SHIFT) as usize;
let chunk_end_minus_one_addr = ptr as usize | ((1 << power) - 1);
let allocator_ptr_addr = chunk_end_minus_one_addr & ALLOCATOR_FIELD_MASK;
let allocator_ptr = ptr.add(allocator_ptr_addr - ptr as usize) as *const Allocator;
let allocator: &Allocator = &*allocator_ptr;
// Reallocate `Vec` using `allocator`
}
}
Notes:
Vec::push
and other potentially-resizing operations.Allocator
is a bit complex, but it's in a cold path.Allocator
ptr is 100% cheap bitwise ops.32 - power
in top 5 bits of capacity
. Then chunk_end_minus_one_addr = ptr as usize | (usize::MAX >> inverted_power)
. This removes 1 operation: https://godbolt.org/z/Mx5WYv86rpush()
is more common than len()
, could store power
in top 5 bits of len
field too. Then "is capacity full?" check in push()
becomes just self.len == self.capacity
(remove an AND operation), at the cost of moving that AND operation into len()
instead.Alternatively, could record power
in bottom 3 bits of ptr
(if all allocations are aligned at least on 8).
This would allow 8 sizes of allocator chunks: 16 MiB, 32 MiB, 64 MiB, 128 MiB, 256 MiB, 512 MiB, 1 GiB, 2 GiB.
That's probably sufficient. However, I think using bits in capacity
is preferable, as capacity
is less frequently accessed than ptr
, which is read on every read/write of the Vec
's elements. Clearing bottom 3 bits is only a single bitwise AND operation, but it's an extremely hot path.
If using bottom 3 bits of ptr
to store power
, could also multiply the stored power by 2, which would give a wider range of chunk sizes:
128KiB, 512 KiB, 2 MiB, 8 MiB, 32 MiB, 128 MiB, 512 MiB, 2 GiB
I still think it's better to use spare bits in capacity
, but advantage of packing it into ptr
is that then Box
could also contain an implicit reference to its Allocator
while remaining 8 bytes. This would allow us to make Box
Clone
, which in turn means all AST types could also be Clone
.
There is also the option to use high bits of ptr
. But these bits can possibly be put to better use: #91
Reduce to 24 bytes Make length and capacity fields u32.
I forked https://github.com/oxc-project/oxc-bumpalo, changing usize
to u32
is hard than it seems :-/
Yes, I realised we'd need to combine all the fields into a single struct:
// 24 bytes
struct ArenaVec<'a, T> {
ptr: NonNull<T>,
len: u32,
capacity: u32,
allocator: &'a Allocator,
}
Splitting it into Vec
and RawVec
(the way std
, bumpalo
and allocator-api2
do) won't work, because RawVec {ptr: NonNull<T>, cap: u32, allocator: &'a Allocator}
is 24 bytes (inc 4 bytes padding). We need to fit len
into those spare 4 bytes, but you can't do that with nested structs.
I don't think combining all the methods into 1 struct is so terrible - it just means moving a lot of the code about, rather than a fundamental change. And we can still break up the implementation into multiple files to make it manageable.
But yeah, it's not trivial.
Related to this, Bumpalo is planning to remove the Vec types in favor of std types as soon as the API is stable. So it would be wise to start using our custom vectors if we wish to keep control over our types.
Here I tried to offer to implement a feature-gated Vec32 type https://github.com/fitzgen/bumpalo/issues/256 and found out that they are not too keen on keeping their vector types(Which makes sense, In general, you want to use std::vec
if it supports custom allocators)
// 24 bytes struct ArenaVec<'a, T> { ptr: NonNull<T>, len: u32, capacity: u32, allocator: &'a Allocator, }
What about this? So we keep it dry.
// type Numeric = either Num trait or our own trait implemented for both u32 and u64
struct PrivateArenaVecNameHere<'a, T, N: Numeric> {
ptr: NonNull<T>,
len: N,
capacity: N,
allocator: &'a Allocator,
}
pub struct ArenaVec<'a, T> {
raw: PrivateArenaVecNameHere<'a, T, usize>,
}
// 24 bytes
#[cfg(or(target_pointer_width = "64", target_pointer_width = "32"))]
pub struct ArenaVec32<'a, T> {
raw: PrivateArenaVecNameHere<'a, T, u32>,
}
Do we still need usize arena allocated vectors?
I've missed the most important part😆We still have to copy methods this way, Unless we do it via a trait that comes with default implementations. Now I hope we don't need both types😅
We still can do this though I'm not sure if it would be nice to use or not, RA tends to ignore type aliases and treat them as the actual type.
// type Numeric = either Num trait or our own trait implemented for both u32 and u64
pub struct ArenaVec<'a, T, N: Numeric> {
ptr: NonNull<T>,
len: N,
capacity: N,
allocator: &'a Allocator,
}
pub type Vec<'a, T> = ArenaVec<'a, T, usize>; pub type Vec32<'a, T> = ArenaVec<'a, T, u32>;
I doubt Allocator
trait will be stabilized any time soon. It's such a large feature that it has it's own working group. With 72 open issues! https://github.com/rust-lang/wg-allocators/issues
I don't actually think it'll be too hard to copy std
/bumpalo
/allocator_api2
's implementation and chop it about. We can probably skip implementing some of the more obscure methods and iterators to reduce the size of the task.
Also relevant:
Vec
is currently 32 bytes. It consists of 4 x 8-bytes fields:Vec
's contents (NonNull<u8>
)usize
)usize
)*const Allocator
)Reduce to 24 bytes
Make length and capacity fields
u32
.This is pretty easy.
Reduce to 16 bytes
If arena allocator always created chunks aligned on 4 GiB boundaries, and stored chunk metadata in a header at start of chunks, the pointer field could do double-duty.
Pointer field would be a pointer to the
Vec
's contents. But also if you zero out the bottom 32 bits it would give you a pointer to the chunk metadata, with which you can grow/reallocate theVec
etc.This would require replacing bumpalo with a custom allocator.