Closed Wodann closed 4 years ago
Thanks for summing this up!
NonZeroLayout
will then be defined as in alloc-wg
+ from_layout_unchecked
?
Looks great overal. Some small remarks/bikeshedding:
alloc
should return a Result
(or do something with the errors)alloc_zero_sized_type
naming should be consistent in some way with alloc_sized
, i.e. both should have type or none should have type.alloc_zero_sized_type
a Layout too for consistency between the different methods. alloc_zero_sized_type
could be made safe if we add a ZeroLayout
type or an Alignment
type (which is always power of 2). If we don't do this, it's probably best to suffix it with _unchecked
for consistency with Layout
and the possibility to add a checked version.NonNull::new_unchecked(ptr)
should be wrapped in Ok(...)
Thanks for summing this up!
NonZeroLayout
will then be defined as inalloc-wg
+from_layout_unchecked
?
Indeed
@JelteF thanks for pointing those out. I fixed 1, 2, and 5.
- alloc_zero_sized_type naming should be consistent in some way with alloc_sized, i.e. both should have type or none should have type.
It seems redundant to pass unnecessary information. I'd go for the minimal API.
- alloc_zero_sized_type could be made safe if we add a ZeroLayout type or an Alignment type (which is always power of 2). If we don't do this, it's probably best to suffix it with _unchecked for consistency with Layout and the possibility to add a checked version.
We could propose a change to Layout
to store the power of two k
, but then you'd always need to calculate the alignment on retrieval (i.e. 2^k
)
After considering some more names of the methods I think I prefer alloc_zero
and alloc_non_zero
. Since there's no type in the API, and a ZST is a Sized
type the size is simply 0.
It seems redundant to pass unnecessary information. I'd go for the minimal API.
Yeah that makes sense indeed.
We could propose a change to Layout to store the power of two k, but then you'd always need to calculate the alignment on retrieval (i.e. 2^k)
I think that's not worth it indeed. So in that case I think it makes sense to rename the method to alloc_zero_unchecked
, to stay consistent with layout. And possibly add alloc_zero
, which checks align.is_power_of_two()
before calling alloc_zero_unchecked
.
I would prefer if the "default" alloc
method supports every type of input, and you have to specifically opt-in to an unsafe alloc_nonempty
variant when you are sure that you are not allocating a ZST.
Also for the naming, I feel that anything with zero
in the name is a bad idea, since it is likely to be confused with zero-initialized allocation. Hence my choice of nonempty
.
Finally, I'm still a bit hesitant on this change. While this does help allocators that don't support ZST allocation by providing a default implementation of that methods, it doubles the work for allocators that do support ZST allocation: now they will need to implement both alloc
and alloc_nonempty
and have one call the other.
I would prefer if the "default"
alloc
method supports every type of input, and you have to specifically opt-in to an unsafealloc_nonempty
variant when you are sure that you are not allocating a ZST.
This is the case in this proposal "alloc" supports both. The only difference in what you describe is that alloc_nonempty is also safe.
Also for the naming, I feel that anything with
zero
in the name is a bad idea, since it is likely to be confused with zero-initialized allocation. Hence my choice ofnonempty
.
Good catch, completely agree.
Finally, I'm still a bit hesitant on this change. While this does help allocators that don't support ZST allocation by providing a default implementation of that methods, it doubles the work for allocators that do support ZST allocation: now they will need to implement both
alloc
andalloc_nonempty
and have one call the other.
I think "doubling" effort is a bit of exaggeration when talking about forwarding a call. I understand what you mean though. However, I don't really see a way around it, which also avoids the problem of requiring a branch when allocating using an allocator that does not support it empty allocations.
I have one other question. Do we really need the alloc_zero_sized_type method. I don't really see a reason why anyone would call that instead of the normal alloc. If so, we can simply inline it in the default alloc implementation.
Also for the naming, I feel that anything with zero in the name is a bad idea, since it is likely to be confused with zero-initialized allocation. Hence my choice of nonempty.
That's a good point. I am still thinking of other potential names, because empty types also exist in Rust..
Finally, I'm still a bit hesitant on this change. While this does help allocators that don't support ZST allocation by providing a default implementation of that methods, it doubles the work for allocators that do support ZST allocation: now they will need to implement both alloc and alloc_nonempty and have one call the other.
I see what you are getting at. I'm not sure at present whether a better solution that would minimize workload in both cases is possible. As I mentioned, in Rust there is only really one requirement for allocating a ZST (i.e. it must be aligned). As such, it is unlikely that someone would want to provide an alternative implementation for alloc_zst
, even if the allocator does support allocation of ZSTs. The only cases where this might be, is for external APIs (e.g. a system allocator). If we cannot provide a cleaner solution, then the question becomes "which use case is more common/takes priority?"
I have one other question. Do we really need the alloc_zero_sized_type method. I don't really see a reason why anyone would call that instead of the normal alloc. If so, we can simply inline it in the default alloc implementation.
Yes, I think it could actually be incorporated into alloc
, as you said.
How would you implement realloc
and dealloc
, if the allocator is not supporting ZSTs? align_of<T>()
may return a valid address to an allocated memory block.
I have another potential proposal:
alloc
handles any Layout
input (including zero sized allocations). But rather than be the default implemented method that forwards to alloc_
[non
]empty
, those methods are defaulted to forward to alloc
.
This means that the minimum work for allocators that do inherently support zero sized allocation is just to implement alloc
. For handles to allocators that don't transparently handle zero sized allocation, they do need to handle the zero sized case (but this is to be expected). We can even provide, say, Layout::dangling<T>() -> ptr::NonNull<T>
to make this just a call to that function after the check for zero size.
For allocators that have an extra cost associated with checking for zero sized layouts, we can still provide a version that doesn't allow zero size to omit that check. It would just default to calling the version that allows zero sized layouts, but allow for the more optimizable path to be used.
So to be somewhat concrete, my proposal would be roughly
unsafe trait AllocRef {
type Error;
fn alloc(self, layout: Layout) -> Result<ptr::NonNull<u8>, Self::Error>;
fn alloc_nonzero(self, layout: NonZeroLayout) -> Result<ptr::NonNull<u8>, Self::Error> {
self.alloc(layout.into())
}
// all the other provided methods
}
(Modulo exact naming ofc)
Though I definitely am sympathetic to the need to make allocation as cheap/predictable as possible, I think we can do so while still providing a simpler edge-case-handling default API for when it's not needed. The pattern of providing specialized methods that default are impled as the generally applicable one, but then optimized implementations impl the generally applicable one by delegating to the specialized ones is already a decently common pattern that we can use here.
I definitely need to take some time to catch up on wg-alloc though because I have developed some Opinions that apparently aren't represented yet.
@TimDiekmann as for r
/dealloc
, remember that these take Layout
arguments. So you'd do the same thing there as for alloc
: check to see if the size is 0 and handle it if the underlying allocator doesn't. We'd probably want to provide the same NonZeroLayout
variants of these methods to allow cases to avoid the cost of that check on those methods as well.
The latter design decision is grounded in Rust's philosophy of providing zero-cost abstractions. The
if layout.size() == 0
branch inalloc
would impose a performance penalty for runtime variable allocations.
Making abstractions “zero-cost” is a nice goal, but it’s not the only goal. I think it should definitely not be taken as an absolute imperative. Is avoiding a single branch really worth significantly increasing the API surface? Is this supposed performance penalty measurable at all in real-world programs? Modern CPUs are often pretty good at branch prediction.
(Also, using foo_sized
in API naming to mean non-zero-size goes against the precedent of the existing Sized
trait which means statically-sized (the size is known at compile-time, and might well be zero) as opposed to dynamically-sized.)
@TimDiekmann
How would you implement realloc and dealloc, if the allocator is not supporting ZSTs? align_of
() may return a valid address to an allocated memory block.
It's entirely acceptable to provide default implementations of some of the methods and then specify what the default does and then say (very clearly) "Also as part of implementing this trait you must override the defaults if the defaults would interact badly with the rest of your implementation."
It's an unsafe
trait, so the burden is on the writer of each implementation to ensure that they wrote something which is correct on both a method by method level and on an overall implementation level.
Sorry for my delayed response; I was on vacation.
I performed a benchmark to give us some real-world data about the performance penalty of having an additional branch in the code.
I had two variants:
a) the type of compiler: bump allocator (low allocation overhead) vs global allocator (high allocation overhead)
b) two scenarios that would cover exclusive usage of the alloc_zst
and alloc_non_zst
functions:
Scenario 1: sized between 1-1024: RNG using uniform distribution (time taken over 10_000_000 allocations) Scenario 2: all 0-sized: used RNG to avoid compile-time optimizations (time taken over 10_000_000 allocations)
I ran every scenario three times for both methods of allocating
================
Global allocator
================
Scenario 1
~~~~~~~~~~
branched direct call
-------- -----------
1: 3206616 3108893
2: 3034585 3201266
3: 3197143 2924017
min: 3034585 2924017 (-3.6%)
max: 3206616 3201266 (-0.2%)
avg: 3146115 3078059 (-2.2%)
Scenario 2
~~~~~~~~~~
branched direct call
-------- -----------
1: 68249 67079
2: 67957 67670
3: 68447 66891
min: 67957 66891 (-1.6%)
max: 68447 67670 (-1.1%)
avg: 68218 67213 (-1.5%)
==============
Bump allocator
==============
Scenario 1
~~~~~~~~~~
branched direct call
-------- -----------
1: 70766 69952
2: 77149 72792
3: 77113 68381
min: 70766 68381 (-3.4%)
max: 77149 72792 (-5.6%)
avg: 75009 70375 (-6.2%)
Scenario 2
~~~~~~~~~~
branched direct call
-------- -----------
1: 67150 67069
2: 62554 66874
3: 63523 67374
min: 62554 66874 (+6.9%)
max: 67150 67374 (+0.3%)
avg: 64409 67106 (+4.2%)
As expected, a scratch/bump allocator stands a lot to gain from optimizations like this. Other allocators have less to gain (as most time is spent on the allocation's (system) call), but are still slightly faster - albeit not significant.
Just for transparency: what was your benchmarking code and platform? I'd be happy to run it on my machine.
The fact that scenario 2 for the bump allocator was actually slower for the direct call seems fishy to me. There shouldn't be any way that's possible, as the direct call should objectively have less work to do in the abstract machine.
@CAD97 I cleaned up and pushed the benchmark code here: https://github.com/Wodann/alloc-wg-benchmark. When cleaning up I realised that I hadn't included the inline(always)
marker, so that might have resulted in a slight performance deviation.
The main.rs
contains an explanation on how to run it from CLI. I also re-ran the benchmark with some more runs, to see if results had something to do with the inline(always)
marker or variability. Like last time, I disconnected from the internet, turned off the virus scanner, etc. etc. These are the results:
OK, after looking at your code, let me try to interpret the data.
First of all, let's assume that we don't really care about the performance of ZST allocation: the vast majority of allocation will be non-ZST. Let's ignore the "zero" results for now.
Secondly, it makes no sense to branch for Bump
since it should be able to handle ZSTs natively with no code changes: just bump the pointer by 0 bytes and return it. Since the implementation of Bump
will be the same whether we allow ZSTs or not for the general allocation function, let's ignore the Bump
results.
So we're basically left with one question: with a global allocator that doesn't support ZSTs (i.e. jemalloc), what is the overhead of including two additional zero-check (one in alloc
, one in dealloc
)? From the results it seems like the cost of the branch is insignificant compared to the allocation itself.
As such, I would like to reiterate my previous position: there should be just one method (alloc
) which must be able to handle both ZSTs and non-ZSTs. If the underlying allocator doesn't support ZSTs (i.e. jemalloc, GlobalAllocator
) then the implementation must handle ZSTs specially in alloc
and dealloc
.
I am against adding methods considering the fact that AllocRef
already has a lot of methods and that adding new variants of alloc
makes this complexity exponential. Consider that if we start adding special handling for ZSTs in the API, we would also have to do so for alloc_zeroed
, realloc
, realloc_zeroed
, shrink_in_place
, grow_in_place
, grow_in_place_zeroed
, etc.
That analysis makes sense to me.
The bump allocator merely served as an edge case for allocators that have low overhead. It can indeed trivially implement allocation of ZSTs. I don't know that there are any other allocators with low overhead that would benefit from a direct approach.
The only thing that could still be added - but in a different form - is a convenience function for creating NonNull<u8>::dangling(align: NonZeroUsize)
from an alignment
. That'd basically take away the extra workload for people implementing the default approach to ZSTs.
If only we had PowerOf2Usize
NonNull<u8>::dangling(align: NonZeroUsize)
What about those:
NonNull<u8>::aligned(align: NonZeroUsize)
NonNull<u8>::aligned_dangling(align: NonZeroUsize)
NonNull<u8>::from_align(align: NonZeroUsize)
All could be unstable under #![feature(allocator_api)]
in the first place, as this is very related.
Can we fix, that we don't want to add dedicated functions for ZST, but we want to add ZST support in all methods? This would also clean up RawVec
a lot.
I like aligned_dangling
. Would we make a specialized implementation, like such:
impl NonNull<u8> {
pub const fn aligned_dangling(align: NonZeroUsize) -> Self {
assert!(align.is_power_of_two());
unsafe {
let ptr = align as *mut u8;
NonNull::new_unchecked(ptr)
}
}
}
Can we fix, that we don't want to add dedicated functions for ZST, but we want to add ZST support in all methods? This would also clean up RawVec a lot.
The only thing I feel we can do at that point, is requiring that any alloc
implementation supports ZSTs. If they implement an allocator dependency that itself doesn't support allocation of ZSTs, they'd have to add it in their AllocRef
trait implementation using aligned_dangling
.
So if there are no concerns, I will:
NonNull<u8>::dangling_with_align
~ Layout::dangling()
to return a valid, aligned NonNull<u8>
: https://github.com/rust-lang/rust/pull/69794AllocRef
to explicitly allow ZSTs in all methods and adjust impl AllocRef for Global/System
to return Layout::dangling()
on ZSTs as both structs calls GlobalAlloc
internally: https://github.com/rust-lang/rust/pull/69799RawVec
implementation: ~https://github.com/rust-lang/rust/pull/69824~ https://github.com/rust-lang/rust/pull/69889 Bikeshed: dangling_with_align
. aligned_dangling
is misleading since it implies that dangling
doesn't produce an aligned pointer, which is false.
Do we want an _unchecked
variant? I don't think this is needed, as alignment is usually nothing you don't know at compile time.
Personally, I like Layout::dangling
; this also takes care of making sure the alignment is aligned to a power-of-two.
There's no reason a ptr::NonNull::dangling_with_align
even has to check that the alignment is a power-of-two; there's no harm in having a weirdly-aligned pointer. (The only case that would be problematic would be NonNull::dangling_with_align(0)
, as without interception it would produce a null pointer. Thus taking NonZeroUsize
. Honestly, given the current ptr::NonNull
API surface area, I'd expect fn NonNull::dangling_with_align(usize)
and unsafe fn NonNull::dangling_with_align_unchecked(usize)
, with the unchecked variant only being nonzero. Not to say that those are ideal, just what I'd expect with the current API.)
Just to clarify: you want Layout::dangling()
to return a NonNull<u8>
?
Yes, that's what I'd probably expect; layout.dangling()
to provide a dangling ptr::NonNull
that "could" be to the given layout (but instead it's dangling). (Though it could be Layout::dangling<T: Sized>(&self) -> ptr::NonNull<T>
, but that's probably unnecessary.)
I really like the idea, as the requirements are implicitly met: A Layout
requires align
to be non-zero and a power of two. Additionally, AllocRef
implementors have a valid Layout anyway. Just calling layout.dangling()
is pretty straight forward.
I don't think either, that Layout::dangling<T>
is a good idea.
After discussion in issue #16, I propose:
1) Adding a default implementation for allocation of ZSTs 2) Splitting
alloc
into two separate functions for allocating ZSTs and sized types.The former design decision is inspired by a well-defined approach to allocating ZSTs, that reduces the workload for trait implementors.
The latter design decision is grounded in Rust's philosophy of providing zero-cost abstractions. The
if layout.size() == 0
branch inalloc
would impose a performance penalty for runtime variable allocations. Hence, we offer the ability to make bare metal calls toalloc_sized
andalloc_zero_sized_type
for performant use cases.Design questions If we want to make a data-driven decision, we could run benchmarks to determine the actual cost of the additional branch on runtime variable allocations.