rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.26k stars 12.71k forks source link

Tracking issue for oom=panic (RFC 2116) #43596

Open Gankra opened 7 years ago

Gankra commented 7 years ago

Update: This now the tracking issue for a mechanism to make OOM / memory allocation errors panic (and so by default unwind the thread) instead of aborting the process. This feature was accepted in RFC https://github.com/rust-lang/rfcs/pull/2116.

It was separated from https://github.com/rust-lang/rust/issues/48043, which tracks another feature of the same RFC, since the two features will likely be stabilized at different times.

Blockers

This should be blocked until we can audit all places in the code that allocate to ensure they safely handle unwinding on OOM.

Steps:

Original feature request:


Several users who are invested in the "thread is a task" model that Rust was originally designed around have expressed a desire to unwind in the case of OOM. Specifically each task has large and unpredictable memory usage, and tearing down the first task that encounters OOM is considered a reasonable strategy.

Aborting on OOM is considered unacceptable because it would be expensive to discard all the work that the other tasks have managed to do.

We already (accidentally?) specified that all allocators can panic on OOM, including the global one, with the introduction of generic allocators on collections. So in theory no one "should" be relying on the default oom=abort semantics. This knob would only affect the default global allocator. Presumably there would just be a function somewhere in the stdlib which is cfg'd to be a panic or abort based on this flag?

This is not a replacement for proper fallible allocation handling routines requested in #29802 because most of the potential users of that API build with panic=abort. Also the requesters of this API are unwilling to go through the effort to ensure all their dependents are using fallible allocations everywhere.

I am dubious on this API but I'm filling an issue so that it's written down and everyone who wants it has a place to go and assert that it is in fact good and desirable.

Gankra commented 7 years ago

Note that in previous topics I stated that libunwind doesn't handle OOM well, but this appears to be outdated or incorrect information. They preallocate their memory before an unwind is even requested. In the limit they might run out of memory and give up but it's not like we've lost anything for trying.

alexcrichton commented 7 years ago

I figured I'd comment on this with respect to https://github.com/rust-lang/rust/issues/42808 and the outcome there along with the @rust-lang/libs team discussion. It was discovered in https://github.com/rust-lang/rust/issues/42808 that the integration of the Alloc trait caused a number of code size related regressions which when diagnosed were almost exclusively related to unwind tables and landing pads. This investigation then caused https://github.com/rust-lang/rust/pull/44049 as the only two stable implementations of allocators don't panic.

Note, though, that this is not necessarily directly related to OOM but just allocator implementations themselves. Should we forbid, in general, panics from allocators? Should we only allow the oom method to panic?

Unsure!

retep998 commented 7 years ago

I personally think it is reasonable to forbid the allocator itself from panicking and instead have the oom method be responsible for handling allocation failures and panicking or aborting as necessary. I also think it would be useful if oom was told the size of the allocation that failed so it could heuristically determine whether it is an actual memory exhaustion event or just someone accidentally specifying a silly large number.

pnkfelix commented 7 years ago

@retep998 fn oom does receive the AllocErr, which for AllocErr::Exhausted will carry the input Layout that was requested. Does that not suffice for what you describe?

(I personally wanted to try to stuff the Layout onto both AllocErr variants, but I worried about bloating the size of the error type after it was no longer an associated item for the allocator...)

retep998 commented 7 years ago

@pnkfelix Yep, that would do exactly what I want. It's my fault for not looking at the existing API and seeing that we already had that 😛

SimonSapin commented 6 years ago

RFC https://github.com/rust-lang/rfcs/pull/2116 was accepted, I’ve repurposed this feature request into a tracking issue. See more in this thread’s original message.

SimonSapin commented 6 years ago

We now have std::alloc::set_alloc_error_hook (tracked at https://github.com/rust-lang/rust/issues/51245) which is document as not allowed to unwind. If that restriction could be lifted, this hook could be the mechanism to implement this feature (by having the program register at start-up a hook that panics).

bstrie commented 6 years ago

Minor question: if someone sets oom=panic and panic=abort, should that function identically to oom=abort, or should it be a compilation error with the rationale that such a configuration is likely a mistake on the user's part?

SimonSapin commented 6 years ago

I expect it would be mostly identical, though the error message that’s printed might not be exactly the same. I don’t know if it’s worth forbidding explicitly even if it’s not particularly useful. An observable effect (though again not particularly useful since std::alloc::set_alloc_error_hook also exists) would be that a hook set through std::panic::set_hook is run on OOM.

ehsanul commented 4 years ago

Making allocation errors default to panic has undergone FCP successfully: https://github.com/rust-lang/rust/issues/66741

I wonder if there's a point to this cargo-level feature anymore?

SimonSapin commented 4 years ago

https://github.com/rust-lang/rust/issues/66741 is only relevant in programs where libstd is not used at all. This issue is about the behavior of libstd.

Ekleog commented 4 years ago

I had sent my ideas to https://github.com/rust-lang/wg-allocators/issues/62 ; but as it looks like there already is a tracking issue, let's move the comments here.

Assume a rust server running multiple things on behalf of clients. It uses regular Rust code, that is not written to care specifically about this particular use case.

The server wants to limit the memory consumption clients could force upon it. It does so by having each client's allocations be in an arena allocator, and then wants to abort just that client's connection in case of “OOM.”

Unfortunately, this means that OOM must not abort the whole process, and AFAIU that is what the current behavior is, even with set_alloc_error_hook -- or at least so do the docs read like.

I'm not sure oom=panic as a global setting could work properly in all cases, as potentially panicking requires memory allocation? But I think that already allowing set_alloc_error_hook to panic would allow the user to swap the error hook at the same time as the allocator, and in case of OOM revert to the global allocator and then, once it's possible to allocate again, perform the panic.

Does that make sense?

Lokathor commented 4 years ago

It's probably the case that (a) the amount of memory needed to alloc panic data is fairly low and (b) you'd hit oom from asking for some large amount that can't be offered. In this situation, you'd probably be able to allocate the panic data.

However, if you fail to allocate while panicing you'll just get a panic, and a panic during panic is an abort. So at least you're not trapped in a loop ;P

But also: you can't carelessly change the global allocator while allocated things are present in the world or they'll get potentially freed on the wrong allocator, which is obviously bad.

Basically the entire scenario you want can't be done with rust's current conventions regarding memory allocation. You'd need specialized local-allocator using code, or waiting for alloc-wg to make the entire ecosystem more alloc aware over time.

Ekleog commented 4 years ago

Just for clarification: I was assuming the existence of a global allocator that is able to use different memory pools for allocation depending on some thread-local. Entering a task's context would set the thread-local to the task's memory pool, and either exiting it or hitting an OOM during it would reset it to the global memory pool. It knows which memory pool to free from without the memory pool being necessarily the same as the one that was configured during allocation, thanks to knowing the address of the memory block to free.

By doing this, I think that what I want to do can be achieved with the current conventions regarding memory allocations, but iff set_alloc_error_hook can be set not to abort, but to panic instead :)

Ekleog commented 4 years ago

Oh, and I forgot to answer your initial points: unfortunately, a client able to make the server abort is a DoS vulnerability, so… that's not a good thing :( And I can totally see the client being able to fill the pool by making the server allocate one byte at a time -- the server could try to prevent that by keeping its own counter of how much memory each task has used, but it requires potentially large changes in every library used by the server, while the solution based on allocators would work without requiring cooperation from every piece of the code (including those I don't control) :)

SimonSapin commented 4 years ago

It knows which memory pool to free from without the memory pool being necessarily the same as the one that was configured during allocation, thanks to knowing the address of the memory block to free.

Yes, this aspect is critical to the soundness of this kind of scheme.

iff set_alloc_error_hook can be set not to abort, but to panic instead :)

Yes, see https://github.com/rust-lang/rust/issues/43596#issuecomment-400659080. I don’t know why this restriction exists, though.

Ekleog commented 4 years ago

For why this restriction exists, I don't know why either, but (for other people who may be reading this, I know you've already read it) the three messages starting from https://github.com/rust-lang/rust/issues/51245#issuecomment-393503805 explain that it's indeed here :)

My first guess would be: panicking might need to allocate itself on some platforms… maybe? And potentially this allocation would not be handled by rust, and thus not be covered by rust's allocation handling, and then Things would happen.

Now, for use cases like the one I describe, where there still is memory available for the program and it's just an arena that got full, allowing the replacement would make sense. But maybe it'd require set_alloc_error_hook to be unsafe?

SimonSapin commented 4 years ago

panicking might need to allocate itself on some platforms… maybe?

It does, on all platforms.

First if you use panic! with a formatting string and additional arguments, the formatted string is allocated:

https://github.com/rust-lang/rust/blob/31add7e60709445617ab54a69f6f21cfcb2e3122/src/libstd/panicking.rs#L362-L363

Then that string is boxed to make panic payload pointer:

https://github.com/rust-lang/rust/blob/31add7e60709445617ab54a69f6f21cfcb2e3122/src/libstd/panicking.rs#L375

If instead you use panic! with a single argument, that value is the payload which is also boxed:

https://github.com/rust-lang/rust/blob/31add7e60709445617ab54a69f6f21cfcb2e3122/src/libstd/panicking.rs#L424

panic! with zero argument has a &'static str payload that goes in the same code path as with single argument.

https://github.com/rust-lang/rust/blob/31add7e60709445617ab54a69f6f21cfcb2e3122/src/libstd/macros.rs#L11-L12

potentially this allocation would not be handled by rust, and thus not be covered by rust's allocation handling, and then Things would happen.

No, these allocations do go through Rust’s global allocator.

If they fail and call a alloc error hook that panics, this is another case of panicking while panicking that can already happen with Drop impls. This causes the process to abort. This happening may be undesirable, but it’s not Undefined Behavior.

But maybe it'd require set_alloc_error_hook to be unsafe?

Why?

Amanieu commented 4 years ago

For why this restriction exists, I don't know why either, but (for other people who may be reading this, I know you've already read it) the three messages starting from #51245 (comment) explain that it's indeed here :)

Because we mark global allocator functions as nounwind to allow the compiler to optimize better.

SimonSapin commented 4 years ago

@Amanieu do you mean functions like std::alloc::alloc? Those return null on failure though. This is about unwinding from std::alloc::handle_alloc_error, which is already in an error path so optimizing it might be less important?

Ekleog commented 4 years ago

But maybe it'd require set_alloc_error_hook to be unsafe?

Why?

This was written under the assumption that possibly some platforms would have unwinding happen outside of the scope of what rustc controls, which might end up doing wrong things if in OOM conditions. It's very possible that this assumption is wrong, though -- I would even hope it is, as it would mean that this unsafe would not be deserved :)

jgallagher-cs commented 3 years ago

It's a little hard to tell from the comments both here and on https://github.com/rust-lang/rust/issues/51245 what is happening and/or needs to happen to make progress on this. My company has a use case very similar to the one described (don't want OOM on one task to tear down the whole process), and we might be able to contribute to moving this forward, but would need some guidance on how to help do that. Could someone point me in the right direction?

sfanxiang commented 3 years ago

Looks like panicking on OOM will never correctly handle all OOM cases, because panic itself can OOM. What about throwing an object that does not require allocation? Instead of

panic!("oom")

we can do

resume_unwind(Box::new(AllocError)); // AllocError is ZST

This has the added benefit that the catcher can distinguish this from other panics (which usually are unrecoverable programming errors).

@alexcrichton @pnkfelix Would the Rust team be interested in reviewing a PR that does this? There are a few other places in libstd that do not allow catching an OOM error - I'm looking into these and so far all of them are fixable.

bjorn3 commented 3 years ago

resume_unwind doesn't exist for #![no_std]. In addition it doesn't actually print a panic message. For that you need panic_any which doesn't exist for #![no_std] either. #![no_std] only allows panicking with an &str or core::fmt::Arguments. It doesn't support arbitrary panic payloads.

sfanxiang commented 3 years ago

resume_unwind doesn't exist for #![no_std]

I take it that oom=panic is for users with std? See rust-lang/rfcs#2116:

  • For users with unwinding, an oom=panic configuration is added to make global allocators panic on oom.
  • For users without unwinding, a try_reserve() -> Result<(), CollectionAllocErr> method is added.

Global allocators and unwinding are std-only. If someday no_std gains unwinding support, we'll probably have resume_unwind for no_std as well.

roblabla commented 3 years ago

Global allocators are available in no_std+alloc scenarios, while resume_unwind is std only I think. Disallowing oom=panic in no_std+alloc would be fine though - those users would likely want to use fallible APIs instead of panicking on oom.

Lokathor commented 3 years ago

if there isn't a way to deny that dependencies ever use infallible allocation then that's basically an unworkable situation.

Amanieu commented 3 years ago

Practically speaking, -C oom=panic would be implemented by having the compiler-generated main function call set_alloc_error_hook to change the hook to one that panics instead of aborting. The changes needed to allow alloc_error_handler to unwind are in #88098 and #88700.

no_std programs should instead define a custom alloc_error_handler which can unwind or abort as needed.

sfanxiang commented 3 years ago

The issue I'm raising is oom=panic will never be oom-safe as long as panicking allocates. An allocation error handler simply should not allocate. To solve this, I think oom=unwind is a good replacement: the actual reason we wanted oom=panic is to unwind.

Bikeshed: Should this be renamed to alloc-error=unwind after 2b789bd0570983e82533f9ed30c80312ac334694 changed everything oom to alloc_error?

bjorn3 commented 3 years ago

To solve this, I think oom=unwind is a good replacement: the actual reason we wanted oom=panic is to unwind.

That still requires allocating the exception object I think. Also catch_unwind which actually catches the panic returns a boxed panic payload.

The issue I'm raising is oom=panic will never be oom-safe as long as panicking allocates.

I believe this problem can be mitigated by using a bit of memory reserved just for allocations necessary during panicking. Maybe it could even be an eagerly initialized thread local variable.

Lokathor commented 3 years ago

let's also remember that one allocation failure doesn't mean eternal allocation failure. If you try to allocate a billion byte array and fail, maybe it's still possible to allocate 20 bytes for an error to describe what went wrong.

sfanxiang commented 3 years ago

catch_unwind which actually catches the panic returns a boxed panic payload.

A boxed ZST doesn't require allocation.

That still requires allocating the exception object I think.

I believe this problem can be mitigated by using a bit of memory reserved just for allocations necessary during panicking.

I agree. However, there's a distinction between reserving an exception object and reserving a panic payload. Exception objects are an implementation detail and it's easy to make them use reserved memory. Panic payload is already exposed via the resume_unwind API which allows the user to catch the payload. The user can simply drop the payload and now you've lost your reserved memory :(

sfanxiang commented 3 years ago

let's also remember that one allocation failure doesn't mean eternal allocation failure. If you try to allocate a billion byte array and fail, maybe it's still possible to allocate 20 bytes for an error to describe what went wrong.

True. Still, is it really a good idea to rely on "small objects" being available? Most allocators don't request memory in increments of 20 bytes - they might request, say, a 64K block each time. Allocating 20 bytes can still become a 64K memory request and fail.

briansmith commented 2 years ago

Panic payload is already exposed via the resume_unwind API which allows the user to catch the payload. The user can simply drop the payload and now you've lost your reserved memory

If the AllocError is a ZST and if Box of a ZST were implemented such that the inner pointer is NULL and allocators are required to implement freeing a NULL pointer as a no-op then there would be no issue, right?

More importantly, the oom=panic feature will create memory safety hazards in code that currently doesn't have memory-safety hazards. In general I think we'll have to assume by default that crates are not compatible with oom=panic unless they explicitly claim otherwise, and I think the build system should enforce this. I expanded on this at https://github.com/rust-lang/rfcs/pull/2116#issuecomment-1006144661.

roblabla commented 2 years ago

if Box of a ZST were implemented such that the inner pointer is NULL

I'm not sure what you mean here, but it is currently guaranteed that Box<T> is a non-null pointer, no matter the T. This allows Option<Box<T>> to be represented as a single pointer. From https://doc.rust-lang.org/stable/std/boxed/index.html#memory-layout:

For zero-sized values, the Box pointer still has to be valid for reads and writes and sufficiently aligned.

And the docs on pointer validity says:

  • A null pointer is never valid, not even for accesses of size zero.
Aaron1011 commented 2 years ago

Box doesn't actually allocate for zero-sized values - it creates a non-null dangling pointer:

fn main() {
    let val = Box::new(());
    println!("{:?}", Box::into_raw(val));
}

prints 0x1

A while back, I experimented with getting a complete zero-alloc panic to work (I used a custom GlobalAlloc that enforced that no allocations happened after a certain point). I believe I got it working for ZST payloads - I can try to dig up the code.

joshtriplett commented 2 years ago

Tagging as needs-summary; it'd help to know the current state of this so we can understand what it would take to stabilize this.

Amanieu commented 2 years ago

It's not implemented yet. The implementation PR is #88098 and is waiting for review.

yaahc commented 2 years ago

@Amanieu what's the current status of this tracking issue? It looks like you implemented the feature in https://github.com/rust-lang/rust/pull/88098, is this just waiting on an eventual push for stabilization?

Amanieu commented 2 years ago

I think this just needs to be stabilized. We might need to audit the standard library to check that we stay sound in the face of unwinding from allocations. Also I'm not sure what the state of the documentation is.

kornelski commented 2 years ago

What are the chances of stabilizing this soon? I've written workaround for it #84612 a year ago, and I wonder if I should push for the workaround still.

nbdd0121 commented 2 years ago

I don't think std is ready for oom=panic yet. The unwinding path is not allocation-free, so likely currently oom=panic only works when a large allocation fail (but the allocator can still satisfy a few small ones).

Lokathor commented 2 years ago

that still seems like a strict improvement over the current situation.

kornelski commented 2 years ago

@nbdd0121 As long as this behavior is safe (I assume it causes abort like a panic during a panic), I don't think this issue is a blocker to stabilization, because this deficiency doesn't change the public interface of oom=panic. It's a quality-of-implementation issue, but even aborting sometimes is better than aborting every time.

nbdd0121 commented 2 years ago

Currently we panic with error message, and that will require allocation, and I consider that a publicly visible behaviour.

Aaron1011 commented 2 years ago

If the error message is fixed, we should be able to pre-allocate the necessary object (though it might require some hacky code in the panic internals).

nbdd0121 commented 2 years ago

Even for fixed error message we need to allocate a Box<dyn Any> for it, and since &str is not a ZST, there are actual heap allocation required. We couldn't preallocate it because we need to allocate one of them for each thread (not all threads are created by std APIs).

C++ doesn't have this issue because it uses special __cxa_allocate_exception and __cxa_free_exception instead of normal heap allocation, which we can't do because our API is just a plain Box.

We could make the panic payload a ZST, and that'll avoid allocation problem on unwind path entirely.

All other allocations that we currently have can be avoided, I think.

SimonSapin commented 2 years ago

We’re constrained to Box<dyn Any> because it’s part of std::thread::Result, right?

Amanieu commented 2 years ago

There are 2 levels of allocation when panicking, first a Box<String> or Box<&str> must be allocated to hold the panic message. Then (with dwarf unwinding) an exception objects needs to be allocated on the heap, which holds the Box<Any>.

nbdd0121 commented 2 years ago

The exception object doesn't need to be allocated on the heap. It can be #[thread_local].