rust-lang / wg-allocators

Home of the Allocators working group: Paving a path for a standard set of allocator traits to be used in collections!
http://bit.ly/hello-wg-allocators
205 stars 9 forks source link

Add default implementation for allocating zero-sized types (or ZSTs) #38

Closed Wodann closed 4 years ago

Wodann commented 4 years ago

After discussion in issue #16, I propose:

1) Adding a default implementation for allocation of ZSTs 2) Splitting alloc into two separate functions for allocating ZSTs and sized types.

The former design decision is inspired by a well-defined approach to allocating ZSTs, that reduces the workload for trait implementors.

The latter design decision is grounded in Rust's philosophy of providing zero-cost abstractions. The if layout.size() == 0 branch in alloc would impose a performance penalty for runtime variable allocations. Hence, we offer the ability to make bare metal calls to alloc_sized and alloc_zero_sized_type for performant use cases.

trait AllocRef {
    fn alloc_sized_type(self, layout: NonZeroLayout) -> Result<NonNull<u8>, AllocErr>;

    #[inline(always)]
    fn alloc(self, layout: Layout) -> Result<NonNull<u8>, AllocErr> {
        if layout.size() == 0 {
             // We want to use NonNull::dangling here, but that function uses mem::align_of::<T>
             // internally. For our use-case we cannot call dangling::<T>, since we are not generic
             // over T; we only have access to the Layout of T. Instead we re-implement the
             // functionality here.
             //
             // See https://github.com/rust-lang/rust/blob/9966af3/src/libcore/ptr/non_null.rs#L70
             // for the reference implementation.
             let ptr = align as *mut u8;
             Ok(unsafe { NonNull::new_unchecked(ptr) })
        } else {
             self.alloc_sized_type(unsafe { NonZeroLayout::from_layout_unchecked(layout) })
        }
    }
}

Design questions If we want to make a data-driven decision, we could run benchmarks to determine the actual cost of the additional branch on runtime variable allocations.

TimDiekmann commented 4 years ago

Thanks for summing this up! NonZeroLayout will then be defined as in alloc-wg + from_layout_unchecked?

JelteF commented 4 years ago

Looks great overal. Some small remarks/bikeshedding:

  1. alloc should return a Result (or do something with the errors)
  2. alloc_zero_sized_type naming should be consistent in some way with alloc_sized, i.e. both should have type or none should have type.
  3. Maybe it makes more sense to give alloc_zero_sized_type a Layout too for consistency between the different methods.
  4. alloc_zero_sized_type could be made safe if we add a ZeroLayout type or an Alignment type (which is always power of 2). If we don't do this, it's probably best to suffix it with _unchecked for consistency with Layout and the possibility to add a checked version.
  5. NonNull::new_unchecked(ptr) should be wrapped in Ok(...)
Wodann commented 4 years ago

Thanks for summing this up! NonZeroLayout will then be defined as in alloc-wg + from_layout_unchecked?

Indeed

Wodann commented 4 years ago

@JelteF thanks for pointing those out. I fixed 1, 2, and 5.

  1. alloc_zero_sized_type naming should be consistent in some way with alloc_sized, i.e. both should have type or none should have type.

It seems redundant to pass unnecessary information. I'd go for the minimal API.

  1. alloc_zero_sized_type could be made safe if we add a ZeroLayout type or an Alignment type (which is always power of 2). If we don't do this, it's probably best to suffix it with _unchecked for consistency with Layout and the possibility to add a checked version.

We could propose a change to Layout to store the power of two k, but then you'd always need to calculate the alignment on retrieval (i.e. 2^k)

JelteF commented 4 years ago

After considering some more names of the methods I think I prefer alloc_zero and alloc_non_zero. Since there's no type in the API, and a ZST is a Sized type the size is simply 0.

It seems redundant to pass unnecessary information. I'd go for the minimal API.

Yeah that makes sense indeed.

We could propose a change to Layout to store the power of two k, but then you'd always need to calculate the alignment on retrieval (i.e. 2^k)

I think that's not worth it indeed. So in that case I think it makes sense to rename the method to alloc_zero_unchecked, to stay consistent with layout. And possibly add alloc_zero, which checks align.is_power_of_two() before calling alloc_zero_unchecked.

Amanieu commented 4 years ago

I would prefer if the "default" alloc method supports every type of input, and you have to specifically opt-in to an unsafe alloc_nonempty variant when you are sure that you are not allocating a ZST.

Also for the naming, I feel that anything with zero in the name is a bad idea, since it is likely to be confused with zero-initialized allocation. Hence my choice of nonempty.

Finally, I'm still a bit hesitant on this change. While this does help allocators that don't support ZST allocation by providing a default implementation of that methods, it doubles the work for allocators that do support ZST allocation: now they will need to implement both alloc and alloc_nonempty and have one call the other.

JelteF commented 4 years ago

I would prefer if the "default" alloc method supports every type of input, and you have to specifically opt-in to an unsafe alloc_nonempty variant when you are sure that you are not allocating a ZST.

This is the case in this proposal "alloc" supports both. The only difference in what you describe is that alloc_nonempty is also safe.

Also for the naming, I feel that anything with zero in the name is a bad idea, since it is likely to be confused with zero-initialized allocation. Hence my choice of nonempty.

Good catch, completely agree.

Finally, I'm still a bit hesitant on this change. While this does help allocators that don't support ZST allocation by providing a default implementation of that methods, it doubles the work for allocators that do support ZST allocation: now they will need to implement both alloc and alloc_nonempty and have one call the other.

I think "doubling" effort is a bit of exaggeration when talking about forwarding a call. I understand what you mean though. However, I don't really see a way around it, which also avoids the problem of requiring a branch when allocating using an allocator that does not support it empty allocations.

JelteF commented 4 years ago

I have one other question. Do we really need the alloc_zero_sized_type method. I don't really see a reason why anyone would call that instead of the normal alloc. If so, we can simply inline it in the default alloc implementation.

Wodann commented 4 years ago

Also for the naming, I feel that anything with zero in the name is a bad idea, since it is likely to be confused with zero-initialized allocation. Hence my choice of nonempty.

That's a good point. I am still thinking of other potential names, because empty types also exist in Rust..

Finally, I'm still a bit hesitant on this change. While this does help allocators that don't support ZST allocation by providing a default implementation of that methods, it doubles the work for allocators that do support ZST allocation: now they will need to implement both alloc and alloc_nonempty and have one call the other.

I see what you are getting at. I'm not sure at present whether a better solution that would minimize workload in both cases is possible. As I mentioned, in Rust there is only really one requirement for allocating a ZST (i.e. it must be aligned). As such, it is unlikely that someone would want to provide an alternative implementation for alloc_zst, even if the allocator does support allocation of ZSTs. The only cases where this might be, is for external APIs (e.g. a system allocator). If we cannot provide a cleaner solution, then the question becomes "which use case is more common/takes priority?"

I have one other question. Do we really need the alloc_zero_sized_type method. I don't really see a reason why anyone would call that instead of the normal alloc. If so, we can simply inline it in the default alloc implementation.

Yes, I think it could actually be incorporated into alloc, as you said.

TimDiekmann commented 4 years ago

How would you implement realloc and dealloc, if the allocator is not supporting ZSTs? align_of<T>() may return a valid address to an allocated memory block.

CAD97 commented 4 years ago

I have another potential proposal:

alloc handles any Layout input (including zero sized allocations). But rather than be the default implemented method that forwards to alloc_[non]empty, those methods are defaulted to forward to alloc.

This means that the minimum work for allocators that do inherently support zero sized allocation is just to implement alloc. For handles to allocators that don't transparently handle zero sized allocation, they do need to handle the zero sized case (but this is to be expected). We can even provide, say, Layout::dangling<T>() -> ptr::NonNull<T> to make this just a call to that function after the check for zero size.

For allocators that have an extra cost associated with checking for zero sized layouts, we can still provide a version that doesn't allow zero size to omit that check. It would just default to calling the version that allows zero sized layouts, but allow for the more optimizable path to be used.

So to be somewhat concrete, my proposal would be roughly

unsafe trait AllocRef {
    type Error;
    fn alloc(self, layout: Layout) -> Result<ptr::NonNull<u8>, Self::Error>;

    fn alloc_nonzero(self, layout: NonZeroLayout) -> Result<ptr::NonNull<u8>, Self::Error> {
        self.alloc(layout.into())
    }

    // all the other provided methods
}

(Modulo exact naming ofc)

Though I definitely am sympathetic to the need to make allocation as cheap/predictable as possible, I think we can do so while still providing a simpler edge-case-handling default API for when it's not needed. The pattern of providing specialized methods that default are impled as the generally applicable one, but then optimized implementations impl the generally applicable one by delegating to the specialized ones is already a decently common pattern that we can use here.

I definitely need to take some time to catch up on wg-alloc though because I have developed some Opinions that apparently aren't represented yet.

CAD97 commented 4 years ago

@TimDiekmann as for r/dealloc, remember that these take Layout arguments. So you'd do the same thing there as for alloc: check to see if the size is 0 and handle it if the underlying allocator doesn't. We'd probably want to provide the same NonZeroLayout variants of these methods to allow cases to avoid the cost of that check on those methods as well.

SimonSapin commented 4 years ago

The latter design decision is grounded in Rust's philosophy of providing zero-cost abstractions. The if layout.size() == 0 branch in alloc would impose a performance penalty for runtime variable allocations.

Making abstractions “zero-cost” is a nice goal, but it’s not the only goal. I think it should definitely not be taken as an absolute imperative. Is avoiding a single branch really worth significantly increasing the API surface? Is this supposed performance penalty measurable at all in real-world programs? Modern CPUs are often pretty good at branch prediction.

(Also, using foo_sized in API naming to mean non-zero-size goes against the precedent of the existing Sized trait which means statically-sized (the size is known at compile-time, and might well be zero) as opposed to dynamically-sized.)

Lokathor commented 4 years ago

@TimDiekmann

How would you implement realloc and dealloc, if the allocator is not supporting ZSTs? align_of() may return a valid address to an allocated memory block.

It's entirely acceptable to provide default implementations of some of the methods and then specify what the default does and then say (very clearly) "Also as part of implementing this trait you must override the defaults if the defaults would interact badly with the rest of your implementation."

It's an unsafe trait, so the burden is on the writer of each implementation to ensure that they wrote something which is correct on both a method by method level and on an overall implementation level.

Wodann commented 4 years ago

Sorry for my delayed response; I was on vacation.

I performed a benchmark to give us some real-world data about the performance penalty of having an additional branch in the code.

I had two variants: a) the type of compiler: bump allocator (low allocation overhead) vs global allocator (high allocation overhead) b) two scenarios that would cover exclusive usage of the alloc_zst and alloc_non_zst functions:

Scenario 1: sized between 1-1024: RNG using uniform distribution (time taken over 10_000_000 allocations) Scenario 2: all 0-sized: used RNG to avoid compile-time optimizations (time taken over 10_000_000 allocations)

I ran every scenario three times for both methods of allocating

================
Global allocator
================

Scenario 1
~~~~~~~~~~

branched        direct call
--------        -----------
1: 3206616      3108893
2: 3034585      3201266
3: 3197143      2924017

min: 3034585    2924017 (-3.6%)
max: 3206616    3201266 (-0.2%)
avg: 3146115    3078059 (-2.2%)

Scenario 2
~~~~~~~~~~

branched        direct call
--------        -----------
1: 68249        67079
2: 67957        67670
3: 68447        66891

min: 67957      66891 (-1.6%)
max: 68447      67670 (-1.1%)
avg: 68218      67213 (-1.5%)

==============
Bump allocator
==============

Scenario 1
~~~~~~~~~~

branched        direct call
--------        -----------
1: 70766        69952
2: 77149        72792
3: 77113        68381

min: 70766      68381 (-3.4%)
max: 77149      72792 (-5.6%)
avg: 75009      ‭70375‬ (-6.2%)

Scenario 2
~~~~~~~~~~

branched        direct call
--------        -----------
1: 67150        67069
2: 62554        66874
3: 63523        67374

min: 62554      66874 (+6.9%)
max: 67150      67374 (+0.3%)
avg: ‭64409‬      67106 (+4.2%)
Wodann commented 4 years ago

As expected, a scratch/bump allocator stands a lot to gain from optimizations like this. Other allocators have less to gain (as most time is spent on the allocation's (system) call), but are still slightly faster - albeit not significant.

CAD97 commented 4 years ago

Just for transparency: what was your benchmarking code and platform? I'd be happy to run it on my machine.

The fact that scenario 2 for the bump allocator was actually slower for the direct call seems fishy to me. There shouldn't be any way that's possible, as the direct call should objectively have less work to do in the abstract machine.

Wodann commented 4 years ago

@CAD97 I cleaned up and pushed the benchmark code here: https://github.com/Wodann/alloc-wg-benchmark. When cleaning up I realised that I hadn't included the inline(always) marker, so that might have resulted in a slight performance deviation.

The main.rs contains an explanation on how to run it from CLI. I also re-ran the benchmark with some more runs, to see if results had something to do with the inline(always) marker or variability. Like last time, I disconnected from the internet, turned off the virus scanner, etc. etc. These are the results:

image

Amanieu commented 4 years ago

OK, after looking at your code, let me try to interpret the data.

First of all, let's assume that we don't really care about the performance of ZST allocation: the vast majority of allocation will be non-ZST. Let's ignore the "zero" results for now.

Secondly, it makes no sense to branch for Bump since it should be able to handle ZSTs natively with no code changes: just bump the pointer by 0 bytes and return it. Since the implementation of Bump will be the same whether we allow ZSTs or not for the general allocation function, let's ignore the Bump results.

So we're basically left with one question: with a global allocator that doesn't support ZSTs (i.e. jemalloc), what is the overhead of including two additional zero-check (one in alloc, one in dealloc)? From the results it seems like the cost of the branch is insignificant compared to the allocation itself.

As such, I would like to reiterate my previous position: there should be just one method (alloc) which must be able to handle both ZSTs and non-ZSTs. If the underlying allocator doesn't support ZSTs (i.e. jemalloc, GlobalAllocator) then the implementation must handle ZSTs specially in alloc and dealloc.

I am against adding methods considering the fact that AllocRef already has a lot of methods and that adding new variants of alloc makes this complexity exponential. Consider that if we start adding special handling for ZSTs in the API, we would also have to do so for alloc_zeroed, realloc, realloc_zeroed, shrink_in_place, grow_in_place, grow_in_place_zeroed, etc.

Wodann commented 4 years ago

That analysis makes sense to me.

The bump allocator merely served as an edge case for allocators that have low overhead. It can indeed trivially implement allocation of ZSTs. I don't know that there are any other allocators with low overhead that would benefit from a direct approach.

The only thing that could still be added - but in a different form - is a convenience function for creating NonNull<u8>::dangling(align: NonZeroUsize) from an alignment. That'd basically take away the extra workload for people implementing the default approach to ZSTs.

Lokathor commented 4 years ago

If only we had PowerOf2Usize

TimDiekmann commented 4 years ago

NonNull<u8>::dangling(align: NonZeroUsize)

What about those:

All could be unstable under #![feature(allocator_api)] in the first place, as this is very related.


Can we fix, that we don't want to add dedicated functions for ZST, but we want to add ZST support in all methods? This would also clean up RawVec a lot.

Wodann commented 4 years ago

I like aligned_dangling. Would we make a specialized implementation, like such:

impl NonNull<u8> {
    pub const fn aligned_dangling(align: NonZeroUsize) -> Self {
        assert!(align.is_power_of_two());

        unsafe {
            let ptr = align as *mut u8;
            NonNull::new_unchecked(ptr)
        }
    }
}
Wodann commented 4 years ago

Can we fix, that we don't want to add dedicated functions for ZST, but we want to add ZST support in all methods? This would also clean up RawVec a lot.

The only thing I feel we can do at that point, is requiring that any alloc implementation supports ZSTs. If they implement an allocator dependency that itself doesn't support allocation of ZSTs, they'd have to add it in their AllocRef trait implementation using aligned_dangling.

TimDiekmann commented 4 years ago

So if there are no concerns, I will:

Amanieu commented 4 years ago

Bikeshed: dangling_with_align. aligned_dangling is misleading since it implies that dangling doesn't produce an aligned pointer, which is false.

TimDiekmann commented 4 years ago

Do we want an _unchecked variant? I don't think this is needed, as alignment is usually nothing you don't know at compile time.

CAD97 commented 4 years ago

Personally, I like Layout::dangling; this also takes care of making sure the alignment is aligned to a power-of-two.

There's no reason a ptr::NonNull::dangling_with_align even has to check that the alignment is a power-of-two; there's no harm in having a weirdly-aligned pointer. (The only case that would be problematic would be NonNull::dangling_with_align(0), as without interception it would produce a null pointer. Thus taking NonZeroUsize. Honestly, given the current ptr::NonNull API surface area, I'd expect fn NonNull::dangling_with_align(usize) and unsafe fn NonNull::dangling_with_align_unchecked(usize), with the unchecked variant only being nonzero. Not to say that those are ideal, just what I'd expect with the current API.)

TimDiekmann commented 4 years ago

Just to clarify: you want Layout::dangling() to return a NonNull<u8>?

CAD97 commented 4 years ago

Yes, that's what I'd probably expect; layout.dangling() to provide a dangling ptr::NonNull that "could" be to the given layout (but instead it's dangling). (Though it could be Layout::dangling<T: Sized>(&self) -> ptr::NonNull<T>, but that's probably unnecessary.)

TimDiekmann commented 4 years ago

I really like the idea, as the requirements are implicitly met: A Layout requires align to be non-zero and a power of two. Additionally, AllocRef implementors have a valid Layout anyway. Just calling layout.dangling() is pretty straight forward.

I don't think either, that Layout::dangling<T> is a good idea.