Closed lhecker closed 4 years ago
You could add a yet another way to panic as per https://github.com/rust-lang/rfcs/pull/1513, in a way which circumvents any catch_unwind
.
@lhecker
If catch_unwind
did not run user code within the with
, would it be better? In that case you still could not switch coroutines during a panic (which you can't do currently anyway, because of PANIC_COUNT
), but you would not have the reference problem.
@nagisa Would that work across crates? I'm also not sure if that would be easier than simply splitting up catch_unwind
into it's 2 atomic parts.
@arielb1 Yeah it would solve the reference problem, but then it would set the prev
to the wrong thread. I think this could be the easiest solution though as long as you make sure that thread::panicking()
is false before you call catch_unwind()
. If no other alternative gets accepted I'll submit a PR for that.
Would that work across crates?
The RFC requires only one panicking mechanism to be used in the final binary.
I'm also not sure if that would be easier than simply splitting up catch_unwind into it's 2 atomic parts.
If we want to think about that at all we gotta act fast, because 1.9 with a stable catch_unwind
is being released in 2016-05-26.
Make it possible to hook into thread_local!, thus solving the problem entirely. I think that this could be hard to achieve in a performant way though.
Something towards this would be my preferred solution, but keep in mind that thread_local!
itself is a wrapper over many layers of abstractions over TLS (any of which could be used anywhere and by anyone).
I don't think another panic behaviour is the right choice here @nagisa, since changing the "obversable" behaviour of panic isn't really what coroutine implementations need. Rather the API or the implementation has to change. Maybe I misunderstood your idea though...
Hooking into thread_local!
would also be my preferred choice if it can be done in a way that doesn't hinder performance. To get a good performance it could be done like https://github.com/rust-lang/rfcs/pull/1513 though, right? (I.e. replacing the TLS lookup function at compile time.)
Apart from that I think that @arielb1's idea is quite good and could be implemented right now without causing any regressions.
I still cannot get the idea behind the PANIC_COUNT
, because it just allows people to panic inside Drop::drop
while panicking. But why is that?
If we can remove the PANIC_COUNT
, and just store a boolean flag in TLS to indicate that whether this thread is panicking, this issue could be solved. In the catch_unwind
, you can restore the flag so that you can panic again.
I don't know why we need to accept panic in drop
?
PANIC_COUNT
doesn't "allow" people to panic inside drop while panicking - it allows the runtime to determine that has happened so it can abort instead of re-unwinding.
PANIC_COUNT
is a number stored in TLS, and I'm not sure how changing that to a bool would materially affect the issues here.
@sfackler Yeah making it a bool wouldn't solve it probably. But why is this implemented in the "runtime" instead of doing it like C++ with it's noexcept
destructors? Is there any technical reason for this? And what about the other solutions I listed in the initial comment?
EDIT: Thanks GitHub for putting the close button right beside the comment button, without adding any confirmation dialogue. UI Design? Anyone? :neutral_face:
@sfackler Well, yes. Changing it to bool won't solve all the problems. But, at least, catch_panic
will work:
thread_local! { pub static IS_PANICKING: Cell<bool> = Cell::new(false); }
// Here is a function that will be called when panic! happens
// Just like the one in https://github.com/rust-lang/rust/blob/master/src/libstd/panicking.rs#L198
fn on_panic(obj: &(Any+Send), file: &'static str, line: u32) {
let is_panicking = IS_PANICKING.with(|s| {
let orig = s.get();
s.set(true);
orig
});
if is_panicking {
// Abort right here
util::dumb_print(format_args!("thread panicked while processing \
panic. aborting.\n"));
unsafe { intrinsics::abort() }
}
// ...
}
// https://github.com/rust-lang/rust/blob/master/src/libstd/sys/common/unwind/mod.rs#L159
pub fn panicking() -> bool {
IS_PANICKING.with(|s| s.get())
}
// The catch_panic
// https://github.com/rust-lang/rust/blob/master/src/libstd/sys/common/unwind/mod.rs#L131
unsafe fn inner_try(f: fn(*mut u8), data: *mut u8)
-> Result<(), Box<Any + Send>> {
if panicking() {
// It should not be allowed (to catch panic while panicking)
unsafe { intrinsics::abort() }
}
let mut payload = imp::payload();
let r = intrinsics::try(f, data, &mut payload as *mut _ as *mut _);
// Clear the flag because we already caught the panic
IS_PANICKING.with(|s| s.set(false));
if r == 0 {
Ok(())
} else {
Err(imp::cleanup(payload))
}
}
it allows the runtime to determine that has happened so it can abort instead of re-unwinding.
I think this solution can also fulfill this purpose and catch_panic
can be used in coroutines now.
But why is this implemented in the "runtime" instead of doing it like C++ with it's noexcept destructors?
Because unwind in Rust is always a cold path and we do not want to generate extra drop glue just to accomodate for unwinds from drop glue.
I don't know why we need to accept panic in drop?
Because it has perfectly sensible use-cases.
Because unwind in Rust is always a cold path and we do not want to generate extra drop glue just to accomodate for unwinds from drop glue.
Hmm... Your statement sounds reasonable, but is that based on actual benchmarking?
Because it has perfectly sensible use-cases.
I kind of don't understand that though, because if your destructor can panic, it will one time correctly unwind the thread and one time it will crash the whole program, because the destructor was called as part of a ongoing unwind. I can't say I like that "wonkyness". If you have time: Would you care to point out a valid use case? It's purely optional and out of interest though.
@lhecker The compiler's own DiagnosticBuilder
can panic in its destructor: https://github.com/rust-lang/rust/blob/3157691f963a86776cb7e6a7842f566032890aba/src/libsyntax/errors/mod.rs#L354-L364
Hmm... Your statement sounds reasonable, but is that based on actual benchmarking?
Implementing nounwind drop glue would simply double the size of binary code used by drop glues. There’s nothing to benchmark here.
If you have time: Would you care to point out a valid use case?
One case, as pointed out by @jonas-schievink, would be ensuring that all objects go out of scope in a valid state. You cannot drop a DiagnosticBuilder
without either emitting or cancelling it, because its a compiler bug otherwise. Similar could apply to something like GuaranteedFlushBufWriter
which only allows dropping in case it is properly flushed. There’s many more things one could think of.
You still do not want these to just abort
because somewhere down the drop chain there might be a drop which would write some restore state/debug logs/user configuration/etc (but it is fine if the application aborts in case those drops panic, because saving such state when panicking is a last ditch effort anyway).
Basically, IME, disallowing panicking in Drop::drop
would be a weird and unpleasant restriction. Whereas, an application panicking twice and then aborting as a result of double-panic won’t make the application any more broken, because it is already broken.
One case, as pointed out by @jonas-schievink, would be ensuring that all objects go out of scope in a valid state. You cannot drop a DiagnosticBuilder without either emitting or cancelling it, because its a compiler bug otherwise.
Aaaah... That makes sense! That's actually a really good argument for panicking drops. Thanks! :blush:
I edited the issue because @alexcrichton finally brought me to my senses regarding the widespread use of TLS in the stdlib. There is actually a specific reason why I always thought that it's not an issue and easily solvable but I'm too emberassed to disclose that dumb idea. :neutral_face:
Implementing nounwind drop glue would simply double the size of binary code used by drop glues. There’s nothing to benchmark here.
How much code is used by drop glue? It might even be worth it to even do RTTI-based drop glue for unwinding.
@arielb1 I’m not sure if I did it correctly, but it seems like librustc has about 53591 bytes of drop glue. libsyntax = 33031B.
(command used: objdump -t libblah.rlib | grep '$10drop.' | awk '{ sum += strtonum(gensub(/^0*/,"0x","g",$5)) } END { print sum }'
)
That's small potatoes.
@arielb1 I wouldn’t call librustc drop-glue intensive, though. Far from it.
I still do wonder why the binary size would blow up that much though... Since all the "nounwind drop glue" would do is to literally call a single method. Isn't that just one EH entry plus one call
in the binary? Compared to the size of drop itself I can't imagine this resulting in a 100% size increase.
Well if the exception handling in Rust would be a bit less "cumbersome" as it is now it would already solve most of the problems with coroutines. All that would be left after that is afaik LOCAL_STDOUT
.
I think the only other option is to add a compile time option akin to RFC 1513. But if we did that I might as well write a RFC to add full opt-in coroutine support to Rust because the difference in effort is probably negligible. (In fact coio
is much faster than the old libgreen anyways already.)
I still do wonder why the binary size would blow up that much though...
If the function (e.g. drop glue) marked as nounwind actually begins unwinding, you have undefined behaviour, therefore you must replace all the occurences of panicking in the nounwind glue with some other non-terminating side effect (e.g. abort
, exit
, unreachable
).
This basically means the compiler would have to generate two kinds of drop glue:
nounwind
which would get called only as a handler for unwinds from other drop glue (i.e. the drops during PANIC_COUNT=1 case).This is basically a 100% or close to 100% increase in drop glue size.
Uhm... Why don't you just call abort in drop()
's landing pad? C++ does that as well and it (or rather "C++ compilers") doesn't replace anything. It just emits call _std_terminate
(or whatever the symbol is called) in the landing pad and that's it. That's one instruction plus the exception handling data for the landing pad.
You can see it here as LLVM IR: http://llvm.org/docs/ExceptionHandling.html#new-exception-handling-instructions
@lhecker
The problem is that then, any panic in a destructor will cause the program to abort
.
OTOH, with the current situation, if there is a DiagnosticEmitter
on the stack during a panic, the program will abort anyway, so maybe it's not that bad.
The problem is that then, any panic in a destructor will cause the program to abort.
I think they mean we should abort on unwind edges from drop glue that is executed as part of unwinding already.
Why don't you just call abort in drop()'s landing pad?
That’s a fair point. I guess there’s a few factors here:
PANIC_COUNT
is/was easier to implement overall and stays all within the library which also provides the unwinding implementation;abort
we use for that a nounwind, which is some extra work.I think they mean we should abort on unwind edges from drop glue that is executed as part of unwinding already.
But then won't we need the double drop glue? Either
1) All destructors are nounwind
- DiagnosticEmitter
aborts when abandoned.
2) Double-unwinding aborts using TLS - this uses TLS which interacts badly with coroutines
3) Unwinding pushes a stack frame that aborts when unwound past - potential for infinite recursion?
4) Double-panicking works somehow - must figure out how
Doubling up all drop-glue will probably require Rust 2.0 - because the destructor for Vec
etc. also needs to be doubled to call the right drop glue.
I think they mean we should abort on unwind edges from drop glue that is executed as part of unwinding already.
I'm unfortunately not that familiar with the terminology in this community so I don't fully understand what you meant with that. (I'm sorry for that.) But I think @arielb1 understood me correctly.
To say it as plain as possible again: I do like C++'s exception rules for destructors more than Rust's complicated system in std::panic
(and the unwind module). This means that any exception inside a destructor will unwind the stack but be catched at the destructor's method boundary and lead to an process abort. It's thus impossible for a stack to be unwound twice.
The problem is that then, any panic in a destructor will cause the program to abort.
Yeah... I understand why panicking destructors are a thing that can be quite useful. So it's not like I don't understand it.
But it still just feels so incredibly "wrong" that it's possible in a "safe" language for a program to one time survive a thread unwind (after which it could spawn it again) and one time where the whole process crashes. And all of that is decided on a "whim" over wether a destructor is run as part of the regular flow or as part of a active unwind.
You know? That's just doesn't deterministic and I personally think that determinism should be something very valuable in programming languages.
@arielb1 Consider this “annotated” MIR CFG for some code I had from a while ago. Circled in red is the drops executed during the unwind, you can easily add edges to abort
/terminate
/etc in MIR (also in red, but lines this time)
The only requirements here are that abort
is diverging and is nounwind and that it is possible to produce correct code in this case (i.e. landingpad
and cleanuppad
support the usecase for all platforms). But, that happens to encode a requirement of no double-panic in MIR which does not look like the right place for the invariant to me.
EDIT: made the image a little bit smaller.
That's my option 3 - it is equivalent to there being a nounwind frame wrapping each landing pad.
I think we did not implement it that way because a destructor that e.g. calls a logger that panics could cause infinite recursion. That does not sound like a real problem - I think we should go with it if we do not want double unwinding.
@arielb1 Ping.. do we have any solutions about this issue?
Hey guys? It has been 2 weeks! What could we do to solve this problem? Do we have any solutions for this? What could we help to do to solve this problem? Or we should just give up developing coroutine library with Rust?
@nagisa @sfackler @arielb1
Hey guys? It has been 2 weeks! What could we do to solve this problem? Do we have any solutions for this? What could we help to do to solve this problem?
As been pointed out already, the use of TLS is very widespread in the standard library (its not just PANIC_COUNT). If you have an idea how to get rid of all TLS in the standard library without compromising on what TLS provides us, the fastest way is to submit relevant pull requests against this repository containing said changes.
Alternative options: use libcore instead and develop your own coroutine-compatible libstd on top of that.
@nagisa I'm pretty sure that @zonyitoo and me would gladly write any necessary code for that. But it'd be cool if any of the possible solutions gets green lit before we start making changes. Since this is a major issue, requiring non-trivial changes. Unfortunately no actual solution was chosen yet, nor where any solutions or recommendation been given by "official" Rust developers.
BTW: Using libcore is absolutely unacceptable since pretty much every crate uses libstd. And removing TLS is no small feat, since it does require modifying the landing pads of destructors, which has not been endorsed yet (and probably would get rejected at this point), while still very likely requiring a huge amount of time to get into rustc's code base to make that change (which - as I said - I'd gladly do but only if it's merged at the end).
Unfortunately no actual solution was chosen yet, nor where any solutions or recommendation been given by "official" Rust developers.
Choosing a solution for non-trivial problems is exactly what the RFC process is for.
I’m pretty sure a proposal to reimplement standard library parts which currently use TLS to use something else would be accepted provided thread-safety and speed are not compromised on.
The question is what you’d replace TLS with? I, personally, have no idea provided the requirements.
modifying the landing pads of destructors, which has not been endorsed yet (and probably would get rejected at this point)
The way I see it modification of codegen is the least of the problems (if necessary at all). Removing TLS doesn’t imply doing that either. While the discussion naturally flowed towards the destructors, libstd also uses thread locals in ways which aren’t related to panicking. You’ll need to fix these as well, at which point the PANIC_COUNT
could fix itself without doing anything exceptional.
Using libcore is absolutely unacceptable since pretty much every crate uses libstd.
You can have your own libstd
with compatible API linked in place of the libstd distributed with rustc without any issues. (--extern std=yourlib.rlib
)
Choosing a solution for non-trivial problems is exactly what the RFC process is for.
I do believe that this is a hard change, but depending on the solution for this it's not a "substantial" one. The only thing right now that might require a RFC would be "abort on panic inside destructors", since it does change semantics (where you can safely panic). (I don't think that this will affect many people though). I do think that I should have opened this issue in the RFC repo though.
I’m pretty sure a proposal to reimplement standard library parts which currently use TLS to use something else would be accepted provided thread-safety and speed are not compromised on. If you're not writing that RFC, I have to and that means it's filled with "Okay boys... Now you tell me what to write here, because I can't possibly decide it. :wink:".
The question is what you’d replace TLS with? I, personally, have no idea provided the requirements.
Which is why it'd be cool if @rust-lang/libs could (finally) say something or at least give some hints about this. :confused:
libstd also uses thread locals in ways which aren’t related to panicking
It's only LOCAL_STDOUT
, THREAD_RNG_KEY
and THREAD_INFO
. Should be doable.
--extern std=yourlib.rlib
Thanks! Didn't knew that.
cc @rust-lang/libs
I'm just gonna take a moment to point out that 64-bit Windows has UMS threads, which give you proper OS threads that support TLS and such, but using a user mode scheduler so you effectively have coroutines, and also have other advantages like returning control to your scheduler whenever you block in a syscall. Doesn't solve the problem on other platforms though.
Doesn't solve the problem on other platforms though.
Well, obviously this is not a good solution.
I am happy to make any PRs for this, but what I am asking for is Rust's team to tell me which solution would be acceptable.
I'd be in favor of either
Again related to 'compilation scenarios' you could easily imagine a global switch that turned on 'slow' TLS and did the right thing everywhere.
If you have an idea how to get rid of all TLS in the standard library without compromising on what TLS provides us, the fastest way is to submit relevant pull requests against this repository containing said changes.
Is it really so unlikely that coroutine aware TLS could be achieved? I can see UMS becoming a thing going forward which would make this much easier so maybe it's not the worst idea in the world to investigate this?
What's the "dependency graph" of TLS within the standard lib? That is:
Without this information explicitly stated it is very hard to know which way is the best way forward.
In my opinion the best solution to this problem is not to fork the standard library, but to:
Send
. Maybe we need a CoroutineSend
trait that detects this. That way you can constrain your library API and prevent users from using TLS. The main issues I see here is that Traits do not apply to functions, only to types, but a function using TLS is no different than a closure (it has some state at the thread level). Depending on how big is the dependency graph of TLS in the standard lib, we might be able to split that out in a std::tls
module and abstract around it, so that users can avoid it by using your coroutine::cls
equivalents instead interfacing with the rest of std
. I think we need more info here to asses this properly.
EDIT: the objective should not be to encourage people to fork libstd to implement green threading, but to make it really easy to implement different green threading solutions as a library that can work "as seamlessly as possible" with libstd.
@gnzlbg TLS is not a global variable, it's very much local. You could think of it like a hidden parameter that is passed to all functions.
Isn't the TLS variable available during the whole lifetime of a thread? (independently of in which function your are)? (in C++ is at least so). Need to read more on Rust thread local variables but I assumed they would work the same.
@gnzlbg global variables have to deal with concurrency, TLS does not. TLS gets away with a RefCell
and without unsafe. A global variable however needs a mutex or something comparable.
Is a closure using a thread local variable Send
? (I suppose so)
EDIT: and if that is the case, does the compiler treat TLS as volatile (i.e. does it reload the address of the TLS variable on every access)?
@gnzlbg Right now TLS does not have to be treated as volatile because during execution of a function you'll never suddenly be on a different thread. If a closure is sent to a different thread then it'll see the TLS of the thread it was sent to and is running on.
@gnzlbg Right now TLS does not have to be treated as volatile because during execution of a function you'll never suddenly be on a different thread.
@retep998 Right, but if a coroutine yields before finishing, and gets sent to another thread where it resumes execution, when it access TLS it will see the values of the thread it was sent from (which might not longer exist! and is unsafe!).
I can only think of two different situations involving coroutines and TLS, and none of them make sense to me in practice.
In Case A, when execution is resumed, the TLS variables still refer to the storage of the thread the coroutine was migrated from. There are two options, either we update them to refer to the TLS the coroutine was migrated to, or we prevent sending coroutines that access TLS between threads.
I think that updating TLS to refer to completely different storage on resumption is a recipe for disaster since it makes very hard to reason about what is going on. So in my opinion, the best thing would be to forbid Case A completely by forbidding coroutines that access TLS from being migrated between threads. That probably means making thread local storage !Send
such that closures/types that refer or contain them also become !Send
.
Case B makes no sense to me either, since in this case the variable should not be thread local in the first place. Still, Case B is safe as long as the thread the coroutine was migrated from outlives the thread the coroutine has been migrated to. So this could be allowed if the lifetimes can be enforced.
Well, it is not a wise move to forbid all coroutines from using TLS values. Adding that constraint on TLS value will make users confused and it also becomes an obstacle for users to migrate their programs to use coroutines.
For your first case, there is a good solution: Make TLS to be volatile. If a coroutine is switched out and migrated to another thread, then when execution is resumed in another thread, all subsequence TLS value access will go to the TLS that coroutine was migrated to.
We can do that by adding a flag to compiler.
For your first case, there is a good solution: Make TLS to be volatile. If a coroutine is switched out and migrated to another thread, then when execution is resumed in another thread, all subsequence TLS value access will go to the TLS that coroutine was migrated to.
I suggested adding volatile above as well but I don't like that myself: at every suspension point of a coroutine, every time the corrutine is resumed, all the values of all TLS might have changed. I think this will make it really hard to reason about what a coroutine is going to do since it is impossible to reason about its current state (think about a generator, suspending inside of a loop, where at every loop iteration you might have different TLS values).
I am pretty sure that there are valid use cases for wanting to silently update TLS on coroutine resumption, but I think that forbidding it will prevent hard to debug user errors (the scheduler might migrate coroutines at will), and when the user really wants the coroutine state to change on resumption, there are more explicit alternatives (e.g. get a handler to the thread in which the coroutine is currently running and use that to access some "thread local" state).
Another reason to be against making all TLS volatile, even when implemented as opt in via a compiler flag, is that it pessimizes all the code that access TLS and that is not migrated between threads.
Still, the most important reason is being able to reason about the code. In C, C++, and D, migrating a coroutine/fiber that access thread local storage between threads is undefined behavior, and they cannot catch it at compile time. Is there any low level language that allows migrating coroutines that access thread local storage between threads? What semantics do they chose?
As far as I know, non of them (system programming languages) allows migrating coroutines that access thread local storage between threads. Take Go as an example, the only way you can use TLS is by cgo, and they run FFI calls in a separate thread (can't be sure).
Your case about suspending inside of a loop is very convincing! In that case, making TLS volatile is not a good choice, or say, it may make the result of program wrong completely!
Back to this issue, is there a way to get rid of those TLS usages in libstd? If we want to make TLS !Send
, all of them has to be gone.
The issue
thread_local!
is used in the stdlib, which does not work well with Coroutines. Repeated access to a TLS variable inside a method might be cached and optimized by LLVM. If a coroutine is transferred from one thread to another this can lead to problems, due to a coroutine still using the cached reference to the TLS storage from the previous thread.What is not the issue
TLS being incompatible with coroutines for the most part (e.g. here) is well known and not an issue per se. You want to use
rand::thread_rng()
with coroutines? Just userand::StdRng::new()
instead! Most of the time it's just quite easy to circumvent the TLS by simply using something different. This is not true for the stdlib though. One way or the other you're using it somewhere probably.Possible solutions
thread_local!
. I think that this could be hard to achieve in a performant way though.PANIC_COUNT
and it's wonky implementation and still make entirely sure that a stack is unwound twice. Other uses of TLS inside the stdlib could be wrapped insideinline(never)
without causing large overheads.I hope we can find a solution for this as this is really a huge problem for using stackful coroutines with Rust and who doesn't want "Go" but with Rust's syntax, eh? :wink: