rust-lang / wg-allocators

Home of the Allocators working group: Paving a path for a standard set of allocator traits to be used in collections!
http://bit.ly/hello-wg-allocators
203 stars 9 forks source link

Avoiding passing size and alignment to dealloc when allocator doesn't need them #120

Open kornelski opened 5 months ago

kornelski commented 5 months ago

Currently Rust requires passing Layout to the deallocator. However, Rust is frequently used with the System allocator that wraps libc::free, which doesn't need this information. This means that often all the extra size and layout tracking on the Rust side is wasted effort. This is a tiny code bloat, but deallocation code paths are nearly everywhere, so it adds up.

https://rust.godbolt.org/z/K3xTd9MEr

I wonder if this could be somehow made optional in the Allocator trait?

Amanieu commented 5 months ago

The counter-argument is that since this information is always available at compile-time, why shouldn't we pass it to the allocator? This is an important optimization for allocators since the size of an allocation may not be stored near the allocation itself.

C++ actually added sized deallocation functions in C++14 (https://open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3536.html) for this reason, but they have the problem of dealing with backwards-compatibility.

zakarumych commented 5 months ago

It's only available in compile-time for things like Box and Vec where type and count can be used to reconstruct same Layout. This is not universally true though, sometimes data structure has to carry Layout. Typically size is still required for some other operations, but alignment is only for deallocation.

But IMO it's counting pennies. I can't imagine a situation where data structure is required to keep track of allocation alignment causing problems.

kornelski commented 5 months ago

I know it's very useful for some allocators, but not for all of them. Specifically, for the very common case of executables using the System allocator based on libc malloc/free it's not an optimization, and only redundant unused information that makes Rust executables larger.

Memory overhead for collections isn't a problem (they have to track sizes anyway), but it's a problem of code bloat. Deallocation calls in Rust are ubiquitous due to unwinding and use of ?, and it adds up. Even when the layout is a compile-time constant, it doesn't get optimized out. It still needs to be materialized and passed as an argument at run-time. Freeing a Box<T> takes twice as many instructions as free().


I'd expect dyn Allocator to always require Layout. However, how about borrowing idea from std::mem::needs_drop, and allowing allocator users to generate smaller drop code where possible. It could be something like an associated const A::NEEDS_LAYOUT (or const fn where Self: Sized? Or maybe just specialization on System?) that would allow code such as:

fn drop(&mut self) {
    if A::NEEDS_LAYOUT { 
        A::dealloc(self.ptr, Layout::new(self.size, etc)) 
    } else { 
        A::layoutless_dealloc(self.ptr) 
    }
}
zakarumych commented 5 months ago

But your example passes layout into __rust_dealloc call. It would require Global to be aware of NEEDS_LAYOUT of whatever allocator is set as global one.

Amanieu commented 4 months ago

I'd expect dyn Allocator to always require Layout. However, how about borrowing idea from std::mem::needs_drop, and allowing allocator users to generate smaller drop code where possible. It could be something like an associated const A::NEEDS_LAYOUT (or const fn where Self: Sized? Or maybe just specialization on System?) that would allow code such as:

Doesn't this just boil down to inlining A::dealloc? That would allow the compiler to see that the layout is unused.

Regarding the system allocator we actually do make use of the alignment information for deallocation in some cases, for example on Windows: https://github.com/rust-lang/rust/blob/9aa232ecc7bb006a1fad404f437b049482021a3a/library/std/src/sys/pal/windows/alloc.rs#L210.

I still think that we should keep the layout for deallocation, but I could be convinced otherwise if you can show a significant size saving in a real program (e.g. ripgrep).

ChrisDenton commented 4 months ago

Regarding the system allocator we actually do make use of the alignment information for deallocation in some cases, for example on Windows: https://github.com/rust-lang/rust/blob/9aa232ecc7bb006a1fad404f437b049482021a3a/library/std/src/sys/pal/windows/alloc.rs#L210.

To be precise, it doesn't need alignment info per se. The issue is that it's like two allocators in one. The first simply wraps HeapAlloc/HeapFree the second additionally adds inline metadata. So it's only necessary to store one bit to tell it which dealloc to use (i.e. with or without the inline header). But admittedly there's no good place to store that bit, other than testing the alignment again.

kornelski commented 4 months ago

layout.alloc is almost always a compile-time constant, so even the Windows' two-allocators-in-one could potentially be optimized, if only the allocator could have some inlineable method at the deallocation call site.

Unfortunately, the special __rust_dealloc indirection makes it hard ;(

  1. The implementation of #[global_allocator] relies on linking global symbols at the linker level, so the crates using Global allocator don't know what they're calling.

  2. __rust_dealloc doesn't seem to be inlined, even with LTO. I've found an LLVM discussion on the same issue. They would also like to optimize allocator calls for the specific allocator implementation, but it seems the problem is that function attributes that make alloc/dealloc functions special are lost when inlining.

So I think for this to happen one of these needs to change. For example,

  1. What if the custom global allocator wasn't an attribute you set anywhere, but an explicit crate dependency, where liballoc and everything else depends on user-provider allocator-implementing crate? I think that would enable normal monomorphisation and cross-crate inlining. Probably tricky from ABI perspective to avoid recompiling liballoc?

  2. I guess there could be an extra LLVM pass that happens after allocation-aware passes, which removes the alloc-family attributes from __rust_alloc functions and redoes inlining and constant propagation. That'd be expensive, but maybe ok if reserved for opt-level=3?

the8472 commented 3 months ago

Tangentially, I think it would be useful if allocators had an associated constant of flags describing their properties, such as whether they support alignment-changing realloc (posix realloc doesn't, jemalloc does), whether their realloc avoids memcpy (doing virtual memory schenanigans), whether deallocation does anything (vs. an arena that drops all allocations at once) and similar things.

Containers can then choose to optimize their behavior based on those properties.