rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
97.68k stars 12.63k forks source link

Incompatibility of Rust's stdlib with Coroutines #33368

Closed lhecker closed 4 years ago

lhecker commented 8 years ago

The issue

thread_local! is used in the stdlib, which does not work well with Coroutines. Repeated access to a TLS variable inside a method might be cached and optimized by LLVM. If a coroutine is transferred from one thread to another this can lead to problems, due to a coroutine still using the cached reference to the TLS storage from the previous thread.

What is not the issue

TLS being incompatible with coroutines for the most part (e.g. here) is well known and not an issue per se. You want to use rand::thread_rng() with coroutines? Just use rand::StdRng::new() instead! Most of the time it's just quite easy to circumvent the TLS by simply using something different. This is not true for the stdlib though. One way or the other you're using it somewhere probably.

Possible solutions

I hope we can find a solution for this as this is really a huge problem for using stackful coroutines with Rust and who doesn't want "Go" but with Rust's syntax, eh? :wink:

nagisa commented 8 years ago

You could add a yet another way to panic as per https://github.com/rust-lang/rfcs/pull/1513, in a way which circumvents any catch_unwind.

arielb1 commented 8 years ago

@lhecker

If catch_unwind did not run user code within the with, would it be better? In that case you still could not switch coroutines during a panic (which you can't do currently anyway, because of PANIC_COUNT), but you would not have the reference problem.

lhecker commented 8 years ago

@nagisa Would that work across crates? I'm also not sure if that would be easier than simply splitting up catch_unwind into it's 2 atomic parts.

@arielb1 Yeah it would solve the reference problem, but then it would set the prev to the wrong thread. I think this could be the easiest solution though as long as you make sure that thread::panicking() is false before you call catch_unwind(). If no other alternative gets accepted I'll submit a PR for that.

nagisa commented 8 years ago

Would that work across crates?

The RFC requires only one panicking mechanism to be used in the final binary.

I'm also not sure if that would be easier than simply splitting up catch_unwind into it's 2 atomic parts.

If we want to think about that at all we gotta act fast, because 1.9 with a stable catch_unwind is being released in 2016-05-26.

Make it possible to hook into thread_local!, thus solving the problem entirely. I think that this could be hard to achieve in a performant way though.

Something towards this would be my preferred solution, but keep in mind that thread_local! itself is a wrapper over many layers of abstractions over TLS (any of which could be used anywhere and by anyone).

lhecker commented 8 years ago

I don't think another panic behaviour is the right choice here @nagisa, since changing the "obversable" behaviour of panic isn't really what coroutine implementations need. Rather the API or the implementation has to change. Maybe I misunderstood your idea though...

Hooking into thread_local! would also be my preferred choice if it can be done in a way that doesn't hinder performance. To get a good performance it could be done like https://github.com/rust-lang/rfcs/pull/1513 though, right? (I.e. replacing the TLS lookup function at compile time.)

Apart from that I think that @arielb1's idea is quite good and could be implemented right now without causing any regressions.

zonyitoo commented 8 years ago

I still cannot get the idea behind the PANIC_COUNT, because it just allows people to panic inside Drop::drop while panicking. But why is that?

If we can remove the PANIC_COUNT, and just store a boolean flag in TLS to indicate that whether this thread is panicking, this issue could be solved. In the catch_unwind, you can restore the flag so that you can panic again.

I don't know why we need to accept panic in drop?

sfackler commented 8 years ago

PANIC_COUNT doesn't "allow" people to panic inside drop while panicking - it allows the runtime to determine that has happened so it can abort instead of re-unwinding.

PANIC_COUNT is a number stored in TLS, and I'm not sure how changing that to a bool would materially affect the issues here.

lhecker commented 8 years ago

@sfackler Yeah making it a bool wouldn't solve it probably. But why is this implemented in the "runtime" instead of doing it like C++ with it's noexcept destructors? Is there any technical reason for this? And what about the other solutions I listed in the initial comment?

EDIT: Thanks GitHub for putting the close button right beside the comment button, without adding any confirmation dialogue. UI Design? Anyone? :neutral_face:

zonyitoo commented 8 years ago

@sfackler Well, yes. Changing it to bool won't solve all the problems. But, at least, catch_panic will work:

thread_local! { pub static IS_PANICKING: Cell<bool> = Cell::new(false); }

// Here is a function that will be called when panic! happens
// Just like the one in https://github.com/rust-lang/rust/blob/master/src/libstd/panicking.rs#L198
fn on_panic(obj: &(Any+Send), file: &'static str, line: u32) {
    let is_panicking = IS_PANICKING.with(|s| {
        let orig = s.get();
        s.set(true);
        orig
    });

    if is_panicking {
        // Abort right here
        util::dumb_print(format_args!("thread panicked while processing \
                                       panic. aborting.\n"));
        unsafe { intrinsics::abort() }
    }

    // ...
}

// https://github.com/rust-lang/rust/blob/master/src/libstd/sys/common/unwind/mod.rs#L159
pub fn panicking() -> bool {
    IS_PANICKING.with(|s| s.get())
}

// The catch_panic
// https://github.com/rust-lang/rust/blob/master/src/libstd/sys/common/unwind/mod.rs#L131
unsafe fn inner_try(f: fn(*mut u8), data: *mut u8)
                    -> Result<(), Box<Any + Send>> {
    if panicking() {
        // It should not be allowed (to catch panic while panicking)
        unsafe { intrinsics::abort() }
    }

    let mut payload = imp::payload();
    let r = intrinsics::try(f, data, &mut payload as *mut _ as *mut _);

    // Clear the flag because we already caught the panic
    IS_PANICKING.with(|s| s.set(false));

    if r == 0 {
        Ok(())
    } else {
        Err(imp::cleanup(payload))
    }
}

it allows the runtime to determine that has happened so it can abort instead of re-unwinding.

I think this solution can also fulfill this purpose and catch_panic can be used in coroutines now.

nagisa commented 8 years ago

But why is this implemented in the "runtime" instead of doing it like C++ with it's noexcept destructors?

Because unwind in Rust is always a cold path and we do not want to generate extra drop glue just to accomodate for unwinds from drop glue.

I don't know why we need to accept panic in drop?

Because it has perfectly sensible use-cases.

lhecker commented 8 years ago

Because unwind in Rust is always a cold path and we do not want to generate extra drop glue just to accomodate for unwinds from drop glue.

Hmm... Your statement sounds reasonable, but is that based on actual benchmarking?

Because it has perfectly sensible use-cases.

I kind of don't understand that though, because if your destructor can panic, it will one time correctly unwind the thread and one time it will crash the whole program, because the destructor was called as part of a ongoing unwind. I can't say I like that "wonkyness". If you have time: Would you care to point out a valid use case? It's purely optional and out of interest though.

jonas-schievink commented 8 years ago

@lhecker The compiler's own DiagnosticBuilder can panic in its destructor: https://github.com/rust-lang/rust/blob/3157691f963a86776cb7e6a7842f566032890aba/src/libsyntax/errors/mod.rs#L354-L364

nagisa commented 8 years ago

Hmm... Your statement sounds reasonable, but is that based on actual benchmarking?

Implementing nounwind drop glue would simply double the size of binary code used by drop glues. There’s nothing to benchmark here.

If you have time: Would you care to point out a valid use case?

One case, as pointed out by @jonas-schievink, would be ensuring that all objects go out of scope in a valid state. You cannot drop a DiagnosticBuilder without either emitting or cancelling it, because its a compiler bug otherwise. Similar could apply to something like GuaranteedFlushBufWriter which only allows dropping in case it is properly flushed. There’s many more things one could think of.

You still do not want these to just abort because somewhere down the drop chain there might be a drop which would write some restore state/debug logs/user configuration/etc (but it is fine if the application aborts in case those drops panic, because saving such state when panicking is a last ditch effort anyway).

Basically, IME, disallowing panicking in Drop::drop would be a weird and unpleasant restriction. Whereas, an application panicking twice and then aborting as a result of double-panic won’t make the application any more broken, because it is already broken.

lhecker commented 8 years ago

One case, as pointed out by @jonas-schievink, would be ensuring that all objects go out of scope in a valid state. You cannot drop a DiagnosticBuilder without either emitting or cancelling it, because its a compiler bug otherwise.

Aaaah... That makes sense! That's actually a really good argument for panicking drops. Thanks! :blush:

lhecker commented 8 years ago

I edited the issue because @alexcrichton finally brought me to my senses regarding the widespread use of TLS in the stdlib. There is actually a specific reason why I always thought that it's not an issue and easily solvable but I'm too emberassed to disclose that dumb idea. :neutral_face:

arielb1 commented 8 years ago

Implementing nounwind drop glue would simply double the size of binary code used by drop glues. There’s nothing to benchmark here.

How much code is used by drop glue? It might even be worth it to even do RTTI-based drop glue for unwinding.

nagisa commented 8 years ago

@arielb1 I’m not sure if I did it correctly, but it seems like librustc has about 53591 bytes of drop glue. libsyntax = 33031B.

(command used: objdump -t libblah.rlib | grep '$10drop.' | awk '{ sum += strtonum(gensub(/^0*/,"0x","g",$5)) } END { print sum }')

arielb1 commented 8 years ago

That's small potatoes.

nagisa commented 8 years ago

@arielb1 I wouldn’t call librustc drop-glue intensive, though. Far from it.

lhecker commented 8 years ago

I still do wonder why the binary size would blow up that much though... Since all the "nounwind drop glue" would do is to literally call a single method. Isn't that just one EH entry plus one call in the binary? Compared to the size of drop itself I can't imagine this resulting in a 100% size increase.

Well if the exception handling in Rust would be a bit less "cumbersome" as it is now it would already solve most of the problems with coroutines. All that would be left after that is afaik LOCAL_STDOUT.

I think the only other option is to add a compile time option akin to RFC 1513. But if we did that I might as well write a RFC to add full opt-in coroutine support to Rust because the difference in effort is probably negligible. (In fact coio is much faster than the old libgreen anyways already.)

nagisa commented 8 years ago

I still do wonder why the binary size would blow up that much though...

If the function (e.g. drop glue) marked as nounwind actually begins unwinding, you have undefined behaviour, therefore you must replace all the occurences of panicking in the nounwind glue with some other non-terminating side effect (e.g. abort, exit, unreachable).

This basically means the compiler would have to generate two kinds of drop glue:

This is basically a 100% or close to 100% increase in drop glue size.

lhecker commented 8 years ago

Uhm... Why don't you just call abort in drop()'s landing pad? C++ does that as well and it (or rather "C++ compilers") doesn't replace anything. It just emits call _std_terminate (or whatever the symbol is called) in the landing pad and that's it. That's one instruction plus the exception handling data for the landing pad.

You can see it here as LLVM IR: http://llvm.org/docs/ExceptionHandling.html#new-exception-handling-instructions

arielb1 commented 8 years ago

@lhecker

The problem is that then, any panic in a destructor will cause the program to abort.

OTOH, with the current situation, if there is a DiagnosticEmitter on the stack during a panic, the program will abort anyway, so maybe it's not that bad.

nagisa commented 8 years ago

The problem is that then, any panic in a destructor will cause the program to abort.

I think they mean we should abort on unwind edges from drop glue that is executed as part of unwinding already.

Why don't you just call abort in drop()'s landing pad?

That’s a fair point. I guess there’s a few factors here:

arielb1 commented 8 years ago

I think they mean we should abort on unwind edges from drop glue that is executed as part of unwinding already.

But then won't we need the double drop glue? Either 1) All destructors are nounwind - DiagnosticEmitter aborts when abandoned. 2) Double-unwinding aborts using TLS - this uses TLS which interacts badly with coroutines 3) Unwinding pushes a stack frame that aborts when unwound past - potential for infinite recursion? 4) Double-panicking works somehow - must figure out how

Doubling up all drop-glue will probably require Rust 2.0 - because the destructor for Vec etc. also needs to be doubled to call the right drop glue.

lhecker commented 8 years ago

I think they mean we should abort on unwind edges from drop glue that is executed as part of unwinding already.

I'm unfortunately not that familiar with the terminology in this community so I don't fully understand what you meant with that. (I'm sorry for that.) But I think @arielb1 understood me correctly.

To say it as plain as possible again: I do like C++'s exception rules for destructors more than Rust's complicated system in std::panic (and the unwind module). This means that any exception inside a destructor will unwind the stack but be catched at the destructor's method boundary and lead to an process abort. It's thus impossible for a stack to be unwound twice.

The problem is that then, any panic in a destructor will cause the program to abort.

Yeah... I understand why panicking destructors are a thing that can be quite useful. So it's not like I don't understand it.

But it still just feels so incredibly "wrong" that it's possible in a "safe" language for a program to one time survive a thread unwind (after which it could spawn it again) and one time where the whole process crashes. And all of that is decided on a "whim" over wether a destructor is run as part of the regular flow or as part of a active unwind.

You know? That's just doesn't deterministic and I personally think that determinism should be something very valuable in programming languages.

nagisa commented 8 years ago

@arielb1 Consider this “annotated” MIR CFG for some code I had from a while ago. Circled in red is the drops executed during the unwind, you can easily add edges to abort/terminate/etc in MIR (also in red, but lines this time)

mir_dynamic_drops_3 gv 3 cairo

The only requirements here are that abort is diverging and is nounwind and that it is possible to produce correct code in this case (i.e. landingpad and cleanuppad support the usecase for all platforms). But, that happens to encode a requirement of no double-panic in MIR which does not look like the right place for the invariant to me.

EDIT: made the image a little bit smaller.

arielb1 commented 8 years ago

That's my option 3 - it is equivalent to there being a nounwind frame wrapping each landing pad.

I think we did not implement it that way because a destructor that e.g. calls a logger that panics could cause infinite recursion. That does not sound like a real problem - I think we should go with it if we do not want double unwinding.

zonyitoo commented 8 years ago

@arielb1 Ping.. do we have any solutions about this issue?

zonyitoo commented 8 years ago

Hey guys? It has been 2 weeks! What could we do to solve this problem? Do we have any solutions for this? What could we help to do to solve this problem? Or we should just give up developing coroutine library with Rust?

@nagisa @sfackler @arielb1

nagisa commented 8 years ago

Hey guys? It has been 2 weeks! What could we do to solve this problem? Do we have any solutions for this? What could we help to do to solve this problem?

As been pointed out already, the use of TLS is very widespread in the standard library (its not just PANIC_COUNT). If you have an idea how to get rid of all TLS in the standard library without compromising on what TLS provides us, the fastest way is to submit relevant pull requests against this repository containing said changes.

Alternative options: use libcore instead and develop your own coroutine-compatible libstd on top of that.

lhecker commented 8 years ago

@nagisa I'm pretty sure that @zonyitoo and me would gladly write any necessary code for that. But it'd be cool if any of the possible solutions gets green lit before we start making changes. Since this is a major issue, requiring non-trivial changes. Unfortunately no actual solution was chosen yet, nor where any solutions or recommendation been given by "official" Rust developers.

lhecker commented 8 years ago

BTW: Using libcore is absolutely unacceptable since pretty much every crate uses libstd. And removing TLS is no small feat, since it does require modifying the landing pads of destructors, which has not been endorsed yet (and probably would get rejected at this point), while still very likely requiring a huge amount of time to get into rustc's code base to make that change (which - as I said - I'd gladly do but only if it's merged at the end).

nagisa commented 8 years ago

Unfortunately no actual solution was chosen yet, nor where any solutions or recommendation been given by "official" Rust developers.

Choosing a solution for non-trivial problems is exactly what the RFC process is for.

I’m pretty sure a proposal to reimplement standard library parts which currently use TLS to use something else would be accepted provided thread-safety and speed are not compromised on.

The question is what you’d replace TLS with? I, personally, have no idea provided the requirements.

modifying the landing pads of destructors, which has not been endorsed yet (and probably would get rejected at this point)

The way I see it modification of codegen is the least of the problems (if necessary at all). Removing TLS doesn’t imply doing that either. While the discussion naturally flowed towards the destructors, libstd also uses thread locals in ways which aren’t related to panicking. You’ll need to fix these as well, at which point the PANIC_COUNT could fix itself without doing anything exceptional.

Using libcore is absolutely unacceptable since pretty much every crate uses libstd.

You can have your own libstd with compatible API linked in place of the libstd distributed with rustc without any issues. (--extern std=yourlib.rlib)

lhecker commented 8 years ago

Choosing a solution for non-trivial problems is exactly what the RFC process is for.

I do believe that this is a hard change, but depending on the solution for this it's not a "substantial" one. The only thing right now that might require a RFC would be "abort on panic inside destructors", since it does change semantics (where you can safely panic). (I don't think that this will affect many people though). I do think that I should have opened this issue in the RFC repo though.

I’m pretty sure a proposal to reimplement standard library parts which currently use TLS to use something else would be accepted provided thread-safety and speed are not compromised on. If you're not writing that RFC, I have to and that means it's filled with "Okay boys... Now you tell me what to write here, because I can't possibly decide it. :wink:".

The question is what you’d replace TLS with? I, personally, have no idea provided the requirements.

Which is why it'd be cool if @rust-lang/libs could (finally) say something or at least give some hints about this. :confused:

libstd also uses thread locals in ways which aren’t related to panicking

It's only LOCAL_STDOUT, THREAD_RNG_KEY and THREAD_INFO. Should be doable.

--extern std=yourlib.rlib

Thanks! Didn't knew that.

nagisa commented 8 years ago

cc @rust-lang/libs

retep998 commented 8 years ago

I'm just gonna take a moment to point out that 64-bit Windows has UMS threads, which give you proper OS threads that support TLS and such, but using a user mode scheduler so you effectively have coroutines, and also have other advantages like returning control to your scheduler whenever you block in a syscall. Doesn't solve the problem on other platforms though.

zonyitoo commented 8 years ago

Doesn't solve the problem on other platforms though.

Well, obviously this is not a good solution.

I am happy to make any PRs for this, but what I am asking for is Rust's team to tell me which solution would be acceptable.

brson commented 8 years ago

I'd be in favor of either

Again related to 'compilation scenarios' you could easily imagine a global switch that turned on 'slow' TLS and did the right thing everywhere.

mitsuhiko commented 8 years ago

If you have an idea how to get rid of all TLS in the standard library without compromising on what TLS provides us, the fastest way is to submit relevant pull requests against this repository containing said changes.

Is it really so unlikely that coroutine aware TLS could be achieved? I can see UMS becoming a thing going forward which would make this much easier so maybe it's not the worst idea in the world to investigate this?

gnzlbg commented 8 years ago

What's the "dependency graph" of TLS within the standard lib? That is:

Without this information explicitly stated it is very hard to know which way is the best way forward.

In my opinion the best solution to this problem is not to fork the standard library, but to:

Depending on how big is the dependency graph of TLS in the standard lib, we might be able to split that out in a std::tls module and abstract around it, so that users can avoid it by using your coroutine::cls equivalents instead interfacing with the rest of std. I think we need more info here to asses this properly.

EDIT: the objective should not be to encourage people to fork libstd to implement green threading, but to make it really easy to implement different green threading solutions as a library that can work "as seamlessly as possible" with libstd.

mitsuhiko commented 8 years ago

@gnzlbg TLS is not a global variable, it's very much local. You could think of it like a hidden parameter that is passed to all functions.

gnzlbg commented 8 years ago

Isn't the TLS variable available during the whole lifetime of a thread? (independently of in which function your are)? (in C++ is at least so). Need to read more on Rust thread local variables but I assumed they would work the same.

mitsuhiko commented 8 years ago

@gnzlbg global variables have to deal with concurrency, TLS does not. TLS gets away with a RefCell and without unsafe. A global variable however needs a mutex or something comparable.

gnzlbg commented 8 years ago

Is a closure using a thread local variable Send ? (I suppose so)

EDIT: and if that is the case, does the compiler treat TLS as volatile (i.e. does it reload the address of the TLS variable on every access)?

retep998 commented 8 years ago

@gnzlbg Right now TLS does not have to be treated as volatile because during execution of a function you'll never suddenly be on a different thread. If a closure is sent to a different thread then it'll see the TLS of the thread it was sent to and is running on.

gnzlbg commented 8 years ago

@gnzlbg Right now TLS does not have to be treated as volatile because during execution of a function you'll never suddenly be on a different thread.

@retep998 Right, but if a coroutine yields before finishing, and gets sent to another thread where it resumes execution, when it access TLS it will see the values of the thread it was sent from (which might not longer exist! and is unsafe!).

I can only think of two different situations involving coroutines and TLS, and none of them make sense to me in practice.

In Case A, when execution is resumed, the TLS variables still refer to the storage of the thread the coroutine was migrated from. There are two options, either we update them to refer to the TLS the coroutine was migrated to, or we prevent sending coroutines that access TLS between threads.

I think that updating TLS to refer to completely different storage on resumption is a recipe for disaster since it makes very hard to reason about what is going on. So in my opinion, the best thing would be to forbid Case A completely by forbidding coroutines that access TLS from being migrated between threads. That probably means making thread local storage !Send such that closures/types that refer or contain them also become !Send.

Case B makes no sense to me either, since in this case the variable should not be thread local in the first place. Still, Case B is safe as long as the thread the coroutine was migrated from outlives the thread the coroutine has been migrated to. So this could be allowed if the lifetimes can be enforced.

zonyitoo commented 8 years ago

Well, it is not a wise move to forbid all coroutines from using TLS values. Adding that constraint on TLS value will make users confused and it also becomes an obstacle for users to migrate their programs to use coroutines.

For your first case, there is a good solution: Make TLS to be volatile. If a coroutine is switched out and migrated to another thread, then when execution is resumed in another thread, all subsequence TLS value access will go to the TLS that coroutine was migrated to.

We can do that by adding a flag to compiler.

gnzlbg commented 8 years ago

For your first case, there is a good solution: Make TLS to be volatile. If a coroutine is switched out and migrated to another thread, then when execution is resumed in another thread, all subsequence TLS value access will go to the TLS that coroutine was migrated to.

I suggested adding volatile above as well but I don't like that myself: at every suspension point of a coroutine, every time the corrutine is resumed, all the values of all TLS might have changed. I think this will make it really hard to reason about what a coroutine is going to do since it is impossible to reason about its current state (think about a generator, suspending inside of a loop, where at every loop iteration you might have different TLS values).

I am pretty sure that there are valid use cases for wanting to silently update TLS on coroutine resumption, but I think that forbidding it will prevent hard to debug user errors (the scheduler might migrate coroutines at will), and when the user really wants the coroutine state to change on resumption, there are more explicit alternatives (e.g. get a handler to the thread in which the coroutine is currently running and use that to access some "thread local" state).

Another reason to be against making all TLS volatile, even when implemented as opt in via a compiler flag, is that it pessimizes all the code that access TLS and that is not migrated between threads.

Still, the most important reason is being able to reason about the code. In C, C++, and D, migrating a coroutine/fiber that access thread local storage between threads is undefined behavior, and they cannot catch it at compile time. Is there any low level language that allows migrating coroutines that access thread local storage between threads? What semantics do they chose?

zonyitoo commented 8 years ago

As far as I know, non of them (system programming languages) allows migrating coroutines that access thread local storage between threads. Take Go as an example, the only way you can use TLS is by cgo, and they run FFI calls in a separate thread (can't be sure).

Your case about suspending inside of a loop is very convincing! In that case, making TLS volatile is not a good choice, or say, it may make the result of program wrong completely!

Back to this issue, is there a way to get rid of those TLS usages in libstd? If we want to make TLS !Send, all of them has to be gone.