Open RalfJung opened 1 year ago
maybe the scope of the return-position noalias only lasts until free is called, not until the end of the program
In fact that's a problem already for a #[global_allocator]
that reuses memory: the moment a pointer comes out of __rust_alloc
, that memory must never ever be used again with any other pointer. But the allocator will of course keep a "root pointer" around somewhere and if this memory range gets freed and reused -- we are technically violating return-position noalias
.
A possible spec (in prose) for __rust_alloc
/alloc::alloc::alloc
(that allows ellision) is:
Non-deterministically calls the
alloc
function of the currently installed global allocator, or returns some suitable storage that is provided by another existing allocation. If the allocation returned by this function is accessed, non-behaviour occurs if the storage for that allocation is provided by deallocated memory
And then __rust_dealloc
/alloc::alloc::dealloc
:
If the allocation pointed to by
p
has it's storage provided by another allocation, non behaviour occurs if that allocation has been deallocated. Otherwise, calls thedealloc
function of the currently installed global allocator.
cc @rust-lang/wg-allocators
The compiler is allowed to optimize away calls to them even if these calls have side-effects. Is there some operational way to rationalize this?
The permissive model I've been using is that global allocations are serviced by the AM, and that separately the AM is permitted to use whatever sound (based on the trait interface) combination of calls into the #[global_allocator]
as part of evaluating the code. This isn't particularly operational, though.
An operational version of this model would be roughly that before each AM step there might nondeterministically additionally be a call to a method of the #[global_allocator]
. This is extremely permissive and makes guaranteeing nonreentrancy impossible, but something like this is needed if we want inserting spurious allocation to be possible, i.e. stack-to-heap promotion.
Box::new(T::new())
to do placement initialization.Or maybe we at least want a way for an allocator to opt-in to such magic?
The default behavior of directly using an allocator implementation should do the obvious thing and run the code that's called. This is of course necessary in order to permit code to rely on stronger guarantees provided by a custom allocator.
This becomes even more interesting when code has direct visibility of the static
annotated with #[global_allocator]
.
But it would certainly be unfortunate if using a custom allocator in some scope were to block optimizations (e.g. that instrumenting allocations prevents removing ones otherwise dead).
The usual model has been to have some sort of Magic<A>
wrapper which nondeterministically either actually uses the inner allocator or the magic AM-provided allocation. (AIUI, this roughly corresponds to annotating the functions with LLVM alloc-family alloc-kind
.) Global
would then be able to use this magic around calling into the #[global_allocator]
.
This also leads to a less permissive operational model that still permits allocation elision: whenever an allocation is requested, nondeterministically choose between calling into the #[global_allocator]
and directly servicing the request with some other method. Daemonic nondet means that code must only rely on the basic guarantees of the allocation interface, whereas angelic nondet would allow code to rely on stronger guarantees.
I've wondered if it's possible to permit code to make global reasoning based on the global allocator (e.g. that it always returns an allocation minimally aligned to eight bytes, thus pointer tagging in low bits is free real estate), e.g. saying that if an allocation actually occurs, it is served by the #[global_allocator]
. Angelic nondeterminism could do so while still allowing some allocation to be moved to the stack if it's obvious that no guarantees about the allocation are used. Unfortunately, I don't think it's sufficient to ever actually do so, because potentially observable effects aren't limited to just refining the output domain range.
It's worth noting that this loses another potential transformation which the freer model permits: allocation merging. Merging (Box<T>, Box<U>)
into Box<T, U>
is theoretically justifiable, but I'm actually talking about removing a deallocate; allocate
pair, e.g. making Box::new(Box::into_inner(b))
into a no-op.
Something exciting happens with provenance of pointers that move in and out of the allocator. But what?
The effect of what happens is essentially the creation of a new Rust Allocated Object, as described by the allocation request (and return value, if extra slack space is provided). This is certainly exciting, because if the allocation is serviced by slicing up some parent RAO, we now have aliasing RAOs. And not the nice kind of address aliasing where they exist on different planes (thus accesses are still disjoint) used to justify resource nonexhaustion.
An overly restrictive attempt at modeling the behavior would be to say that upon returning an allocation, that memory region is effectively deallocated to make space for the new RAO given to the caller. Upon deallocating the child RAO, the memory region is restored to the allocator's RAO (likely uninitialized) and it's given a pointer with the same provenance as the one it originally returned out.
I say it's overly restrictive mainly because it causes trouble for scoped allocation; a scoped allocator wants to be able to bulk free everything at the end of the scope, but if a child allocation hasn't been returned under that scheme, doing so would be UB. However, a similar scheme without this flaw could fairly straightforwardly be constructed using some meta-tag for RAOs on top of the existing borrow tag for references; it doesn't need the complexity of SB/TB, just a simple stack tracking which RAO has access currently.
This probably should only be done when going through the magic allocator barrier; a normal call to a normal function (that just so happens to be an allocator) shouldn't result in extra provenance mucking. wg-alloc has previously said that dealloc requires the same (or at least equally capable) provenance to be given as was returned from alloc, so everything should work out fine (at least given TB's &Header
fix over SB) without needing magic to make things work that wouldn't without.
@chorman0773 that spec seems to allow a compiler to just entirely ignore #[global_allocator]
and always call some other allocator that can provide "suitable storage"? That's way too permissive, we need to give the programmer stronger guarantees than that. (Also "suitable storage" is way too weak of a guarantee, as a programmer I need a guarantee that until I deallocate this, I have exclusive ownership of that memory -- no reads or writes from anyone that I didn't share the pointer with.)
We can say it can non-deterministically choose the actually registered global allocator or the AM-native allocator; that would solve some of these issues -- but not the one where it always picks the AM-native one...
@CAD97
An operational version of this model would be roughly that before each AM step there might nondeterministically additionally be a call to a method of the #[global_allocator].
That doesn't explain why I might call Box::new(0)
and then ask the allocator how many times it has been invoked and it says "0".
The effect of what happens is essentially the creation of a new Rust Allocated Object, as described by the allocation request (and return value, if extra slack space is provided). This is certainly exciting, because if the allocation is serviced by slicing up some parent RAO, we now have aliasing RAOs. And not the nice kind of address aliasing where they exist on different planes (thus accesses are still disjoint) used to justify resource nonexhaustion.
There's also the alternative where this is not a new Rust Allocated Object, but just a new sub-tree of pointer IDs in the old object (a la Tree Borrows).
This probably should only be done when going through the magic allocator barrier
My question you quoted above was also about whether this should also be done when invoking a custom allocator. Currently that's also just a regular function call, after all.
It can't call another allocator, it has to be provided from the storage of an existing allocation. The actual manner the storage is provided would be unspecified, but the intention is that it's piggy-backed on the existing memory. For example if a global_allocator call is "provided" by a local variable, the actual allocation would be put onto the stack. It could also be put into static storage, or into another dynamic allocation that is no longer being used.
That doesn't explain why [...] ask the allocator how many times it has been invoked and it says "0".
That version of the model is relying on Global
always being "the AM provides memory", independent from that the way the AM provides memory is via whatever #[global_allocator]
is provided.
There's also the alternative where this is not a new Rust Allocated Object, but just a new sub-tree of pointer IDs in the old object (a la Tree Borrows).
This works for specific known allocators where no magic alloc substitution is done. This doesn't work for the global replaceable allocation, at least not while we have operations like GEP inbounds
which care about the RAO bounds rather than the provenance bounds.
Having a subtree of pointer tags/IDs which e.g. pointer offsets look at instead of RAO bounds is essentially isomorphic to making an aliasing RAO subtag, which I did mention as an alternative to the more heavyweight deallocation model. This is because everything which currently references RAO would now instead reference the new concept of RAO tag.
So yes, having a tree of RAO (sub)views is the model which I was (attempting to) lay out.
This holds as necessary even in the face of switching pointer offsets to GEP nowrap
btw: if I know my concrete allocator always provides addresses within some region, I'm permitted to do an offset, knowing it won't wrap, which would wrap were the allocation to be substituted for magic AM provided memory in a different memory region.
My question you quoted above was also about whether this should also be done when invoking a custom allocator. Currently that's also just a regular function call, after all.
And the answer I attempted to make prior in the post is that directly calling into an allocator (even the #[global_allocator] static
item) should just be a regular function call, and that a magic allocator wrapper be used to add on the replaceable semantics.
that spec seems to allow a compiler to just entirely ignore #[global_allocator] and always call some other allocator that can provide "suitable storage"? That's way too permissive, we need to give the programmer stronger guarantees than that. (Also "suitable storage" is way too weak of a guarantee,
I don't see how it could be; suitable storage means satisfying all guaranteed properties of the allocator API. FWIW, wg-alloc did IIRC decide that relying on any property of the configured global allocator when using the global allocation functions is always unsound, even when you know what the global allocator is.
To provide a stronger guarantee than "suitable storage" out of the Global
allocator is to forbid substitution of the allocator. Which, I suppose, is a question we do still need to answer: do we permit substitution of global general-purpose allocations, or exclusively just removal of allocations? If the program observes that an allocation actually exists (e.g. observes the address), does that mean that the allocation must exist in the #[global_allocator]
? Is the following rewrite permissible?:
// original
fn main() {
let mut global = Box::new(GlobalData::new());
real_main(&mut *global);
drop(global);
}
// transformed to:
#[start] // guaranteed nonreentrancy
unsafe fn _start() {
static mut GLOBAL: MaybeUninit<GlobalData> = MaybeUninit::uninit();
let global = GLOBAL.write(GlobalData::new());
real_main(&mut *global);
GLOBAL.assume_init_drop();
}
// and provide the same main if might be called
This substitutes a source allocation that's known to exist for the life of the program for a static allocation that doesn't require using the dynamic allocator.
Both my and Connor's spec is written assuming that this is permitted and wanted to be permitted. In theory, yes, the compiler could serve all global allocation by some substitution instead of using the #[global_allocator]
, and I don't know of any way to allow allocation elision that doesn't also permit allocation substitution.
The compiler not substituting allocations arbitrarily is a QOI issue. Rustc+LLVM doesn't seem to ever do heap-to-stack alloc substitution, but AIUI it should be permitted to.
That version of the model is relying on Global always being "the AM provides memory", independent from that the way the AM provides memory is via whatever #[global_allocator] is provided.
Please use less confusing terminology. Global
is literally defined to be whatever #[global_allocator]
provides (plus magic wrappers).
This holds as necessary even in the face of switching pointer offsets to GEP nowrap btw: if I know my concrete allocator always provides addresses within some region, I'm permitted to do an offset, knowing it won't wrap, which would wrap were the allocation to be substituted for magic AM provided memory in a different memory region.
This just goes to show that you wouldn't be permitted to do the offset, since you cannot reply on your allocator actually being called. So with GEP nowrap
I don't think there would be any barrier to reusing the aliasing tree/stack.
do we permit substitution of global general-purpose allocations, or exclusively just removal of allocations? If the program observes that an allocation actually exists (e.g. observes the address), does that mean that the allocation must exist in the #[global_allocator]?
I would have expected the answer to be "no we do not permit substitution; yes it must be in the global allocator."
There are two separate models I've considered:
std::alloc::alloc
, std::alloc::dealloc
, std::alloc::realloc
, std::alloc::Global as GlobalAlloc
, and std::alloc::Global as Allocator
provides memory satisfying the allocation request as described via an unspecified manner.#[global_allocator]
that are sound per the API contract.#[global_allocator]
is a quality-of-implementation concern. However, access to system allocation is also via #[global_allocator]
. E.g. miri could ignore #[globall_allocator]
and always use its builtin VM allocation, but notrustc could not just ignore #[global_allocator]
and always use the system allocator.Replaceable<A: Allocator>
which services allocation by non-deterministically either servicing the allocation request itself (via AM builtin magic) or by forwarding the request to the wrapped allocator.std::alloc::Global
is logically implemented as using Replaceable<GA>
, where GA
is a "ZST reference" to the registered #[global_allocator] static
item.This just goes to show that you wouldn't be permitted to do the offset, since you cannot reply on your allocator actually being called. So with GEP nowrap I don't think there would be any barrier to reusing the aliasing tree/stack.
I would have expected the answer to be "no we do not permit substitution; yes [observably performed allocations] must be in the global allocator."
The main underlying issue is that I have no idea how to justify these two at the same time. Using "angelic nondet substitution" will never be able to justify removal, because the #[global_allocator]
side can have arbitrary effects which removal does not preserve. Using "daemonic nondet substitution" still permits every allocation to be serviced by some other AM-managed static
and never use the #[global_allocator]
.
If you want nondet to be forced to use the #[global_allocator]
, you need some definition for when the allocation is "observed to actually exist," and you need to restrict that to observations exclusively on this side of the allocator bridge if you want to be able to ever determine that it isn't. The point about GEP is that if GEP is ever allowed to offset out of the requested allocation size, then it can serve as an observation of whether the concrete allocator was used. I believe the allowable operations in a "magic" allocation are read, write, and GEP inbounds of the magic allocation; any other operation on the pointer (or a derived pointer) is a potential observation of whether the allocation is "real" and in the #[global_allocator]
.
A concrete example: is it allowed to remove the heap allocation in this example? rustc+LLVM currently does not remove it (by default). Returning a value not derived from the pointer value does remove the allocation.
pub unsafe fn alloc_weirdness(layout: Layout) -> usize {
let p = alloc(layout);
if !p.is_null() {
dealloc(p, layout);
}
transmute(p)
}
A smaller secondary issue is that it's been my impression that we do want to permit heap-to-stack substitution, at least. I'd also not be super comfortable committing to a model that forbids substitution without LLVM documenting what the optimization semantics of "alloc-family"
-annotated functions are.
A concrete example: is it allowed to remove the heap allocation in this example? LLVM currently does not remove it. Returning a value not derived from the pointer value does remove the allocation.
If you take your example and make the layout constant, remove the null check, and add -C passes=attributor
, then LLVM will remove it (edit: by which I mean it will replace the heap allocation with an actual stack allocation).
Not sure how relevant that is… there are probably much dodgier things you can get LLVM to do by enabling non-default passes… but, well, I saw a reference to it in a mailing list post and was curious.
I'll agree that adding in LLVM passes that aren't enabled by default by the rustc codegen backend is at best dodgy, but the fact that LLVM does have a pass to optimize "alloc-family"
functions to alloca
s does indicate fairly strongly that LLVM considers such substitution valid (even if not desirable by default). If we want to forbid "real substitution" of our global allocation functions, we'd need the LLVM codegen backend to
"alloc-family"="FAMILY"
annotations if there's any chance of the heap2stack pass being enabled, orallockind("KIND")
annotations to indicate what removal/substitution/unification/spurious/repordering semantics are permitted for the alloc family.I see good reason for rustc to provide a QOI-style guarantee that there isn't a bonus general-purpose allocator that can get used instead of the #[global_allocator]
. I don't see a good reason that the spec for Replaceable<A>
(the allocator magic wrapper) should restrict the semantics any further than a nondeterministic choice per allocated object whether to use the wrapped allocator or some other source. Especially when (AIUI) replacing/substituting allocations is the typical reasonably well understood justification path for removing allocations. (It may just be a gap in my knowledge, but repeating, I don't know of any model which justifies removal but not heap2stack substitution; even the simple "allocation isn't an observable event" does.)
But one guarantee I think we can and definitely should provide is that a replaced allocation will succeed, and that the only way allocation fails (returns null) is by calling the underlying allocator. Replacing allocations with a failed allocation is both degenerate and nondesired. If a sanitizer wants to inject failed allocations it can do so by instrumenting the global allocator instead of the standard allocation replacement.
I agree that heap-to-stack transformation is a reasonable transformation that we shouldn't close the door on. So I guess you are right, it is allowed to use the AM-native allocator instead of calling the declared global one. To me this clearly sounds like demonic non-det, I don't understand why you consider that problematic?
I am less convinced that we want to allow spurious heap allocations. Having the compiler invoke the allocator arbitrarily any time seems really hard to reason about. Some code really cares about not invoking the allocator in certain sections (such as in signal handlers, interrupt handlers, or the implementation of the allocator itself) -- how would that be guaranteed?
For what happens with the allocations, one model I was considering (assuming that we have non-contiguous allocations) is that when memory is returned from a custom allocator, we "punch a hole" into the allocation it comes from and generate a fresh AllocId for that same address range. This behaves like a regular fresh AM allocation until it is passed to the corresponding deallocation operation, which ends the new AllocId and puts that memory range back into the allocation it came from. On both of these "copies", the contents of that memory get reset to Uninit
.
But one guarantee I think we can and definitely should provide is that a replaced allocation will succeed, and that the only way allocation fails (returns null) is by calling the underlying allocator. Replacing allocations with a failed allocation is both degenerate and nondesired. If a sanitizer wants to inject failed allocations it can do so by instrumenting the global allocator instead of the standard allocation replacement.
If we do heap-to-stack transform then allocation can fail with a stack overflow. So I think we do have to allow the allocation to fail without actually calling the underlying allocator. But the native AM allocator will never return null, that much we can indeed guarantee.
If we do heap-to-stack transform then allocation can fail with a stack overflow. So I think we do have to allow the allocation to fail without actually calling the underlying allocator. But the native AM allocator will never return null, that much we can indeed guarantee.
To be explicit, this is what I meant by replaced allocation being infallible — that it won't ever return a null pointer to user code. Stack overflow is "usual" resource exhaustion which can occur essentially arbitrarily.
I am less convinced that we want to allow spurious heap allocations. Having the compiler invoke the allocator arbitrarily any time seems really hard to reason about. Some code really cares about not invoking the allocator in certain sections (such as in signal handlers, interrupt handlers, or the implementation of the allocator itself) -- how would that be guaranteed?
It's very heavy-handed, but it's necessarily the case that no_std code without alloc doesn't contain any allocation. It would be possible, if not quite desirable, to have access to spurious allocation as a compiler flag and/or a (attribute controllable) target feature.
Another less principled but useful to actual code motion option is to say that spurious allocation may only be introduced when allocation is potentially reachable by at least one potential branch in the scope. The result is that moving allocation across branch points is allowed, but introducing truly spurious allocation isn't.
However, that every performed allocation corresponds to an actual evaluated source allocation is a useful property to maintain.
To me this clearly sounds like demonic non-det, I don't understand why you consider that problematic?
It's not, the seen problem is with trying to combine that with any amount of guarantee that the concrete allocator is ever actually used, and that it isn't entirely replaced with a bundled static allocator. I'm comfortable leaving that to QOI so long as others are.
Another less principled but useful to actual code motion option is to say that spurious allocation may only be introduced when allocation is potentially reachable by at least one potential branch in the scope. The result is that moving allocation across branch points is allowed, but introducing truly spurious allocation isn't.
Making transformations depend on dead code is a doomed approach IMO. It makes it a lot harder to reason about optimization correctness since the usual correctness notion used formally (contextual refinement) does not suffice any more. So I am rather fundamentally opposed to any approach of this sort.
Another optimization case which came up on IRLO is merging alloc+realloc to just a single alloc [source] and merging dealloc+alloc into a reuse when the layout matches [source]. If the opsem of optimizable allocators only does ndet choosing between "magic" and "real" allocation providers, I don't think either of these are justified.
I don't see a way that can justify these optimizations without decoupling source allocator calls from evaluated allocator calls. But we could still constrain the actually performed allocation somewhat: specify it roughly as
Each call to an
Allocator
method ofMagic<A>
non-deterministically performs zero or one arbitrary sound (per the trait definition) method calls to anAllocator
method ofA
. This call will often be pass-through, but doesn't need to be; for example, a call torealloc
may correspond to a call toalloc
instead, if a prior call toalloc
was optimized out.
This prevents code motion like reordering Box::new(Data::new())
to Box::new_uninit().init_with(Data::new)
, but permits a whole class of optimizations that merely ndet doing the call or using an alternative memory source can't.
Your spec doesn't allow these optimizations either though, or does it? E.g. if I first allocate 8 bytes, and then later realloc to 16 bytes, maybe I want to optimize this to allocate 16 bytes from the start -- but the allocator could then see the request for 16 bytes rather than 8 bytes. So that would violate the property that every performed allocation corresponds to a source allocation. (The realloc
might not actually be reached since the program might stop before it gets there.)
Or do you mean that you allow altering argument values at allocator function call sites as well? This feels rather unprincipled, TBH -- I would expect we run into fairly arbitrary-seeming limitations with this approach.
It does mean to allow arbitrary argument values in the calls to the underlying allocator. The only thing controlled for is the timing of calls to the underlying allocator, which must match the timing of a source allocator call; everything else is allowed to be arbitrary.
I believe this is necessary in order to allow merging realloc(alloc(layout::<[u8; 16]>()), layout::<[u8; 32]>())
to alloc(layout::<[u8; 16]>())
but still prohibit adding allocation calls where none exist prior. This timing-only restriction may result in some seemingly arbitrary limitations, but I think this is almost[^more] the weakest the spec can be, and removing the timing restriction means allocations can be introduced spuriously, which I think we want to avoid.
[^more]: The given spec permits "zero or one" calls to the underlying allocator at each source call. Weaker would also allow more calls. Then the only restriction is the one-to-many timing restriction.
The restriction is in the abstract a principled one, as it is basically stating that how <Magic<A> as Allocator>
's implementation calls into <A as Allocator>
is unspecified. It might be arbitrarily nondeterministic, but it will satisfy the Allocator
trait, and it will only use the underlying allocator in a sound manner (so long as it is used in a sound manner).
Interesting side observation: whatever the choice is here will interact with allocations in const
context if/when those are allowed, as those certainly aren't using the registered #[global_allocator]
.
Does the following (atop @CAD97's existing wording) make sense as a route towards restricting argument values on calls to an Allocator
method of A
?
The argument values in the call to a method on
A
must be taken from a call to a method onMagic<A>
.
This has the possible extension of:
Magic<A>
can use each argument from a call to its methods at most once in all calls to theAllocator
methods onA
.
This then allows the merging of a realloc
and an alloc
, since the newly produced alloc
will take its size arugment from the realloc
's size and the original alloc
's smaller request size will be left unused, but doesn't allow the optimizer to ask for a surprisingly large amount of memory, since Magic<A>
can't give you a call with a surprisingly large allocation to forward to the "real" allocator unless there is a call in the source that it's been able to elide.
The restriction is in the abstract a principled one, as it is basically stating that how <Magic as Allocator>'s implementation calls into is unspecified. It might be arbitrarily nondeterministic, but it will satisfy the Allocator trait, and it will only use the underlying allocator in a sound manner (so long as it is used in a sound manner).
Thanks for that, this helps. I haven't seen anything like this before and the consequences seem hard to gauge, but it doesn't feel completely arbitrary any more.
The given spec permits "zero or one" calls to the underlying allocator at each source call. Weaker would also allow more calls. Then the only restriction is the one-to-many timing restriction.
FWIW, the restriction to "at most one" does feel somewhat arbitrary, so I feel like we might as well remove it.
An interesting question came up here: if deallocate
is written to not actually free the underlying memory, is it UB to access that memory after calling deallocate
?
Given the fact that these functions are "magic", I think I would expect that the AllocId that is synthesized when allocate
returns becomes invalid in deallocate
no matter what.
Custom allocators (
#[global_alloc]
) come with all sorts of exciting interactions with the operational semantics:noalias
to reflect this magic. That attribute is super strong though.GlobalAlloc
trait or theAllocator
trait) are just regular function calls. But maybe we want those to be magic as well for more optimizations? Or maybe we at least want a way for an allocator to opt-in to such magic? AnAllocator
that returns memory from some previously-allocated buffer certainly violates LLVM return-positionnoalias
so we can't give them the full treatment (at least not without first working with LLVM to improve their semantics -- maybe the scope of the return-positionnoalias
only lasts untilfree
is called, not until the end of the program).