rust-lang / wg-allocators

Home of the Allocators working group: Paving a path for a standard set of allocator traits to be used in collections!
http://bit.ly/hello-wg-allocators
203 stars 9 forks source link

Idea: Allocation Splitting #105

Open infomorphic-matti opened 1 year ago

infomorphic-matti commented 1 year ago

Right now, Allocator provides methods to grow and shrink an allocation, as well as to allocate and deallocate. This lets us do things like efficiently grow a Vec without copying if there's more memory available to allocate in the same place, and return that memory to the allocator if it's no longer needed.

However, there is no way currently to divide an allocation in two such that deallocating one half does not deallocate the other.

This means that if you want to, say, split off a vector into two halves, it requires a copy and a new allocation to occur so that one vector does not accidentally deallocate parts of the other vector.

Even if you shrink_to_fit the vector before splitting, it is required that there is at least n more bytes available from the allocator (where n is the number of bytes required to store the split off data), and you have to copy it.

The idea, then, is to add a method to the Allocator trait that looks something like the following:

pub trait Allocator {
    ...
    ...
    unsafe fn split(
        &self, 
        ptr: NonNull<u8>, 
        old_layout: Layout, 
        new_start_layout: Layout, 
        new_end_layout: Layout
    ) -> Result<(NonNull<[u8]>, NonNull<[u8]>), AllocError>;

}

Where the new layout halves have the same alignment as the old layout, and each returned pointer can be deallocated separately.

This would let you develop something like:

fn split_at<T>(parent: Box<[T]>, idx: usize) -> (Box<[T]>, Box<[T]>);

which splits a slice into two slices at the given point, each owned independently.

Or perhaps

fn split_at<T>(parent: Vec<T>, idx: usize) -> (Vec<T>, Vec<T>);

which divides the underlying slice at the index, leaving the first vector with no free capacity and giving the rest of the underlying slice and capacity to the second vector, which can then do all the standard vector things like trying to grow allocations, shrink, be deallocated or have references taken, etc. With a well-defined allocator, this means that no copying or new allocation has to occur whatsoever.

Especially in the case where an allocator calls out to a systemwide allocator, this might provide significant performance benefits if chunking up large pieces of data to pass around in which only small amounts of that data are likely to get modified afterwards, even where the original large buffer might disappear later (which prevents the use of typical slice splitting methods due to lifetime complexity).

This could also make conversion of owned types like String into tree-ish types like string ropes far more efficient as well, reducing copies significantly by allowing owned chunks of data to be split in a similar manner to what we can do with slices today.

Thoughts?

CAD97 commented 1 year ago

I don't know if any allocator actually supports this.

thomcc commented 1 year ago

One that's directly backed by mmap/VirtualAlloc sort of does at the page granularity, but I don't think it makes sense to expose this because of that.

infomorphic-matti commented 1 year ago

When I came up with the idea I was more thinking about allocators that might use other storage as backing perhaps? Like using a big slice of static bytes as backing storage in an embedded environment, or perhaps providing restricted memory usage for some sub-process function.

For these sorts of allocators being able to reduce copies and avoid even temporary allocation is pretty valuable (at least in my opinion), as it can reduce the risk of some inner process unnecessarily overflowing some inner memory usage restriction when it would otherwise be fine.

For cases where it can't split an allocation block like that you could have a default fallback that reallocates and copies (like essentially occurs in the default implementation of Vec::split_off), and just shrinks the original allocation. In this case it's essentially the same as what has to be done now anyway as default.

In terms of issues of allocator chunk size, if you work with allocators more specialised for specific types or with backing storage chunked out to multiples of the alignment of a smaller type, it becomes more relevant, I think. For example I think this allocator would probably be able to split it's allocations into smaller pieces if the type in the layout has a size and alignment of the chunk size.

Lokathor commented 1 year ago

Any individual allocator is free to provide its own methods if it can do something special (such as forcibly resetting a bump allocator). However, if it's not something that can generally be provided by most or all allocators then it's not a good fit for the Allocator trait in general.

infomorphic-matti commented 1 year ago

The issue with that - at least IMO - is that it means that the liballoc collections are essentially pessimistic in how they use memory when the option to avoid copies is available, which forces anyone who wants to take advantage of a feature like this for an allocator to rewrite their own data structures.

Especially in the case of larger Vecs being chunked off into smaller pieces, such an ability in an allocator could provide a significant performance advantage when available by avoiding linear copies if unnecessary.

We already have grow and shrink with appropriately defined fallbacks for when an allocator doesn't support them, so I think adding another allocator primitive with a fallback like that would be worth the potentially significant performance gains in some cases, and avoid requiring people reimplement Vec just to get a single function like that.

Of course there is the potential issue of people being careless with this kind of splitting in cases where an allocator does not have that capability and such that the vectors in question just constantly allocate, but such a situation would already be the case if you need to copy the contents to a new location anyhow.

thomcc commented 1 year ago

For these sorts of allocators being able to reduce copies and avoid even temporary allocation is pretty valuable (at least in my opinion), as it can reduce the risk of some inner process unnecessarily overflowing some inner memory usage restriction when it would otherwise be fine.

I mean it would be nice, but I've never seen an allocator that provides this functionality. It seems difficult to implement in a way that's very memory efficient, which seems important for those cases. Regardless, even if it is possible to implement efficiently (not really on topic for this discussion), there are plenty of niche allocator APIs that aren't worth providing on the Allocator trait, even though that means the standard collections can't use them.

CAD97 commented 1 year ago

The existence of Vec::split_off is the best argument to provide this functionality as a customization point for allocators. It is enticing to allow the allocator to directly implement this if it has the functionality.

Note though that adding new allocator methods isn't (nearly) free the way adding defaulted methods to other traits is.

Adding a method to a trait adds an entry to the vtable to reference the implementation function. For a normal trait, this cost is fairly negligible. For Allocator, though, a new method means a new magic __rust_alloc_split symbol that needs to be provided by the runtime.

For something like split which is so rarely overridable[^1], it seems to make more sense to keep the implementation on this side of the (abstract machine[^2]) allocation barrier. This isn't without benefits, either; if split is on this side, the copy/allocation to the split off allocation can be elided in code paths that don't use it. If the allocator does this work on the other side of the barrier, there's no way to elide that but not the shrinking of the still used prefix part.

aside about trait Storage I'm also perhaps a little bit biased here, given my current work to propose replacing `A: Allocator` with [`S: Storage`](https://cad97.github.io/storages-api/storage_api/index.html). The biggest argument against the `Storage` API is the relative complexity of the API surface, so a big part of what I'm doing is paring it down to just the fundamental complexity and proving that the remaining complexity is fundamental to the problem domain and not incidental to the design. Perhaps split could live on [`MultipleStorage`](https://cad97.github.io/storages-api/storage_api/trait.MultipleStorage.html), as `unsafe fn split(&mut self, handle: Self::Handle, old_layout: Layout, start_layout: Layout, end_layout: Layout) -> Result<(Self::Handle, Self::Handle), AllocError>`? But supporting more operations is just making `Storage` a harder pill to take.

[^1]: It's IIUC basically limited to allocators which treat all allocations uniformly (i.e. don't do size bucketing to combat fragmentation) and track allocation metadata in a side table (i.e. don't prefix the metadata to the allocation like most major allocators do). This kinda feels like linked lists (which also provide an O(1) split!) where in theory it seems useful but in practice isn't useful outside niche usecases due to allocation/access patterns hurting their performance in practice.

[^2]: Where exactly the abstract machine allocation barrier (where allocations turn into fresh Rust Allocated Objects) is both undecided and contentious. Many people in wg-alloc and wg-UCG think the barrier should be at #[global_allocator] / __rust_[de]alloc. (And I'm not impartial to the benefits of this! It's much simpler than any other alternative.) At least I, on the other hand, think that it could make sense to place the allocation barrier on Allocator (né GlobalAlloc), primarily such that e.g. specifying System as your allocator doesn't pessimistically prevent[^3] elision of allocations, but also to benefit the specification of how provenance works across the allocation boundary. (e.g. Windows's System allocator overallocates by an alignment unit in order to offset to the requested alignment; if you use System directly, is writing to that space UB? Does the call to dealloc need to have the exact same provenance as returned from alloc, or is provenance over the deallocated fitting Layout sufficient? Without an AM barrier, the answers are necessarily no (it is allowed to read/write the slack space) and yes (it is required to dealloc with the alloc provided provenance).)

[^3]: The optimizer probably also recognizes the system allocation functions as removable / not observable, so they'll probably get elided anyway. The validity of this where we only launder new RAOs on __rust_alloc is potentially questionable, since this optimization will change the observable state of the AM (that being the sequence of FFI calls). The targeted solution is just to use the same wording for System as is done for Global, but the same caveat applies to any other explicit non-#[global_allocator] allocator.

infomorphic-matti commented 1 year ago

Ahhh, I wasn't aware of some of that special runtime magic necessary for this.

My main interest in this potential allocation primitive comes from attempting to combine zero copy parsing and partial/interruptible parsing where buffers may be extended by further input, where being able to peel off of owning buffers of input efficiently without copying or extra space is highly useful (as it allows you to detach the lifetimes of two halves of a buffer so that the second can be modified while something else holds onto the first). I also realised this might provide significant benefits for the case where people construct tree structures out of sequential backing buffers for primarily reading.

I think I may have been underrecognising how specialised this concept seems, especially with having to add runtime magic. Still, I will probably be adding this as a specialised trait to my own library instead as in my case it provides very significant performance gains in that case.

And the optimiser point is a good one, though I think in that case it would be a tradeoff of if you are meta-optimising for the case where people always or often use the split-off piece and extend that, or the case where they only rarely use it and hence the extra allocations could have been elided, in combination with how likely people are to write to the first half, forcing an allocation. [^1][^2]

thoughts on storage I'm definitely interested in the idea of a storage API (I'm fairly ambivalent on the issue of storage vs allocators for containers) and I can certainly see how putting this sort of function on a multi-storage might be more viable than on allocators where it's a fairly niche feature, unlike storages where splitting things in half is a fairly standard capability. I think there's be some issues with splitting at dynamic offsets though, if we start talking about inline stack-based storages. It works better with the use of references in that case, but that's an implementation detail of storages more than anything and could be easily fixed with a slightly different custom storage....

I might close the issue in another day or two, if people don't think this is viable or people don't have other thoughts on that. To be honest I don't know what the policy for closing and opening issues is.

[^1]: I believe you're right here, though it would also work with e.g. bump allocators. Furthermore there is also the more general optimisation here where an allocator only has to extend an allocation a small amount and shift even poorly aligned splits along a little bit to fill space to alignment, instead of having to make a full-length temporary allocation. It would essentially allow for more contiguous memory allocation.

[^2]: On the question of the abstract machine, I wonder if it would ever be possible for the compiler to look across that boundary. This is a more general idea and kinda offtopic though, but especially in terms of composable allocators more generally it might be quite a useful ability to elide away parts of the code that call out to inner allocators. Just food for thought here though, I'm not so familiar for how simple implementing the machinery for this would actually be and whether it'd be worth the extra complexity (and how much this is already done).