Scoped thread implicit join doesn't wait for thread locals to be dropped

pvillela commented 11 months ago

When a scoped thread is implicitly joined, the destructors of thread-local variables are not guaranteed to have completed when the scope is exited. When a scoped thread is explicitly joined, however, the destructors of thread-local variables do complete before the scope is exited.

[..]

Scoped thread implicit join fails 'happens before' guarantee

As documented in the Rust Atomics and Locks book by Mara Bos, "joining a thread creates a happens-before relationship between the joined thread and what happens after the join() call". This is not true for implicit joins on scoped threads.

I tried this code:

use std::{hint::black_box, mem::replace, thread};

static mut CONTROL: Option<String> = None;

fn main() {
    // Explicit join of scoped thread.
    {
        for _ in 0..5000 {
            thread::scope(|s| {
                let h = s.spawn(|| {
                    FOO.with(|v| {
                        black_box(v);
                    });
                });
                h.join().unwrap();
            });

            // SAFETY: this happens after the thread join, which provides a `happens before` guarantee
            let v = unsafe { &CONTROL };
            black_box(format!("{:?}", v));
            print!(".");
        }
        println!("Completed EXPLICIT join loop.");
    }

    println!();

    // Implicit join of scoped thread.
    {
        for _ in 0..5000 {
            thread::scope(|s| {
                let _h = s.spawn(|| {
                    FOO.with(|v| {
                        black_box(v);
                    });
                });
            });

            // SAFETY: this happens after the implicit thread join, which should provide a `happens before` guarantee
            let v = unsafe { &CONTROL };
            black_box(format!("{:?}", v));
            print!(".");
        }
        println!("Completed IMPLICIT join loop.");
    }
}

struct Foo(());

impl Drop for Foo {
    fn drop(&mut self) {
        // SAFETY: this happens before the thread join, which provides a `happens before` guarantee
        let _: Option<String> = unsafe { replace(&mut CONTROL, Some("abcd".to_owned())) };
    }
}

thread_local! {
    static FOO: Foo = Foo(());
}

I expected to see this happen: Both the explicit join and implicit join portions of the code should terminate normally.

Instead, this happened: The explicit join portion terminates normally as expected, but the implicit join portion panics (the number of iterations on the loop before it panics varies) .

This issue is related to issue #116179.

Meta

The same behaviour is observed on the nightly version nightly-x86_64-unknown-linux-gnu unchanged - rustc 1.74.0-nightly (0288f2e19 2023-09-25).

rustc --version --verbose:

rustc 1.72.1 (d5c2e9c34 2023-09-13)
binary: rustc
commit-hash: d5c2e9c342b358556da91d61ed4133f6f50fc0c3
commit-date: 2023-09-13
host: x86_64-unknown-linux-gnu
release: 1.72.1
LLVM version: 16.0.5

Backtrace

``` thread 'main' panicked at 'byte index 5 is out of bounds of `hP�^`', library/core/src/fmt/mod.rs:2324:30 stack backtrace: 0: rust_begin_unwind at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:593:5 1: core::panicking::panic_fmt at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/panicking.rs:67:14 2: core::str::slice_error_fail_rt 3: core::str::slice_error_fail at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/str/mod.rs:87:9 4: core::str::traits:: for core::ops::range::Range>::index at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/str/traits.rs:235:21 5: core::str::traits:: for str>::index at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/str/traits.rs:61:15 6: ::fmt at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/fmt/mod.rs:2324:30 7: ::fmt at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/alloc/src/string.rs:2271:9 8: <&T as core::fmt::Debug>::fmt at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/fmt/mod.rs:2268:62 9: core::fmt::builders::DebugTuple::field::{{closure}} at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/fmt/builders.rs:322:17 10: core::result::Result::and_then at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/result.rs:1319:22 11: core::fmt::builders::DebugTuple::field at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/fmt/builders.rs:309:35 12: core::fmt::Formatter::debug_tuple_field1_finish at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/fmt/mod.rs:2035:9 13: as core::fmt::Debug>::fmt at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/option.rs:559:37 14: <&T as core::fmt::Debug>::fmt at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/fmt/mod.rs:2268:62 15: core::fmt::rt::Argument::fmt at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/fmt/rt.rs:138:9 16: core::fmt::write at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/fmt/mod.rs:1094:21 17: core::fmt::Write::write_fmt at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/fmt/mod.rs:192:9 18: alloc::fmt::format::format_inner at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/alloc/src/fmt.rs:610:16 19: alloc::fmt::format::{{closure}} at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/alloc/src/fmt.rs:614:34 20: core::option::Option::map_or_else at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/option.rs:1180:21 21: alloc::fmt::format at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/alloc/src/fmt.rs:614:5 22: thread_local_destruction_in_scoped_thread2::main at ./general/src/bin/thread_local_destruction_in_scoped_thread2.rs:41:23 23: core::ops::function::FnOnce::call_once at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/ops/function.rs:250:5 note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace. ```

workingjubilee commented 11 months ago

You know, technically, it only says it has the "happens before" on the API docs of ScopedJoinHandle::join. I'm guessing there being a difference is not intended behavior, however. cc @m-ou-se

workingjubilee commented 11 months ago

This error is spotty, sometimes it happens, sometimes it doesn't. It happens since at least 1.63, when this function was stabilized.

workingjubilee commented 11 months ago

nope, seems this has been happening since day -128, checked with 7737e0b5c 2022-04-04 using

slightly modified code for version changes

```rust #![feature(bench_black_box)] #![feature(scoped_threads)] use std::{hint::black_box, mem::replace, thread}; static mut CONTROL: Option = None; fn main() { // Explicit join of scoped thread. { for _ in 0..5000 { thread::scope(|s| { let h = s.spawn(|_| { FOO.with(|v| { black_box(v); }); }); h.join().unwrap(); }); // SAFETY: this happens after the thread join, which provides a `happens before` guarantee let v = unsafe { &CONTROL }; black_box(format!("{:?}", v)); print!("."); } println!("Completed EXPLICIT join loop."); } println!(); // Implicit join of scoped thread. { for _ in 0..5000 { thread::scope(|s| { let _h = s.spawn(|_| { FOO.with(|v| { black_box(v); }); }); }); // SAFETY: this happens after the implicit thread join, which should provide a `happens before` guarantee let v = unsafe { &CONTROL }; black_box(format!("{:?}", v)); print!("."); } println!("Completed IMPLICIT join loop."); } } struct Foo(()); impl Drop for Foo { fn drop(&mut self) { // SAFETY: this happens before the thread join, which provides a `happens before` guarantee let _: Option = unsafe { replace(&mut CONTROL, Some("abcd".to_owned())) }; } } thread_local! { static FOO: Foo = Foo(()); } ```

joboet commented 11 months ago

Explicit joins go through the platforms join operation, which waits until the spawned thread has actually, fully terminated. Therefore, operations performed in TLS destructors are seen by the joining thread.

Implicit joins on the other hand are implemented via an atomic counter that tracks the number of extant threads. The spawning thread parks at the end of the scope until the counter has been set to zero. Scoped threads decrement the counter when exiting their main function and unpark the spawning thread if necessary. The bug described here occurs because the TLS destructors are run after the main function completes, therefore there is no relationship formed between their operations and the counter load at the end of the scope.

As I see it, there are two ways to resolve this:

Explicitly state that there are no ordering guarantees for TLS destructors. The bug described here can only occur in unsafe code, as safe code can neither store referenced data in TLS nor write to shared data without proper synchronization.
Store all created threads in a linked list and use the platform join operation to join the remaining threads, providing the synchronization required for the code in this issue to be sound.

m-ou-se commented 10 months ago

Thanks for the bug report!

It's a good question whether this behaviour is acceptable or not. In many ways, TLS destructors behave a bit weirdly.

As mentioned above, the implicit joining behaviour does properly wait for the thread's main function to finish, so this is never a problem in safe code.

So, I don't think I'd categorize this as a high priority bug, or perhaps not even as a bug at all. But I do agree it would be good to fix this, if we can find a good way to do that.

Keeping track of all the join/thread handles in some sort of (linked) list would be problematic, as that'd result in an ever growing list (~a memory leak) when using a long-lived scope. (E.g. a webserver that all runs in a single thread::scope, which spawns a thread for every request. It'd never exit the scope, so the list would just grow forever.)

I'll investigate to see if we can find a better solution. Perhaps the counter needs to be decremented by a TLS destructor itself, or we can trigger TLS destruction ourselves, or perhaps we can think of another solution.

Amanieu commented 10 months ago

I'll investigate to see if we can find a better solution. Perhaps the counter needs to be decremented by a TLS destructor itself, or we can trigger TLS destruction ourselves, or perhaps we can think of another solution.

This might work. TLS destructors are executed in reverse order of registration, so if we register one right at the start of the scoped thread, it should execute last after any other TLS destructors in the thread.

joboet commented 10 months ago

That's not guaranteed and very unreliable. If a destructor registers another destructor, that one will run after all others. On platforms that use keyed TLS, there is no predefined order at all.

workingjubilee commented 10 months ago

It could be acceptable to define the implicit join as inherently slightly racy if we're confident that it has no unsoundness per se in doing so (i.e. it is very obvious this is bad if combined with the aforementioned unsafe code), as we can shift the blame to the unsafe code.

However, that leaves the question:

            // SAFETY: this happens after the implicit thread join
            // which should provide a `happens before` guarantee
            let v = unsafe { &CONTROL };
            black_box(format!("{:?}", v));
            print!(".");
        }
        println!("Completed IMPLICIT join loop.");
    }
}

impl Drop for Foo {
    fn drop(&mut self) {
        // SAFETY: this happens before the thread join
        // which should provide a `happens before` guarantee
        let _: Option<String> = unsafe { replace(&mut CONTROL, Some("abcd".to_owned())) };
    }
}

Would either of these unsafe blocks, themselves, be okay, if the other one didn't exist? It seems the answer is that the unsafe read of the static mut is never okay, at least.

Also, in the previous example, it should be noted that executing with optimizations appears to make the first mem::drop and the implicit scoped thread's TLS Drop::drop significantly more likely to race. The thread local's Drop sometimes doesn't even get to its print before the mem::drop is over with.

Amanieu commented 9 months ago

We discussed this in the library team meeting and the decision was the keep the existing behavior. However we are open to any suggestions on how the documentation can be improved to be clearer. Feel free to open a doc PR.

workingjubilee commented 9 months ago

That may be more easily done if we know what the motivation was for deciding to preserve the existing behavior, contra the alternatives? And which unsafe block is the one that "actually" contains the UB?

m-ou-se commented 5 months ago

That meeting happened on Zulip, here.

What we discussed is that the way to solve this involves keeping track of all the join handles in the main thread, which can pile up (behave like a memory leak) for a long living scope, so we see no other resolution than just accepting the current behaviour of scope().

m-ou-se commented 5 months ago

Note that the thread_local documentation says:

Note that a “best effort” is made to ensure that destructors for types stored in thread local storage are run

Although it's easy to argue that this "best effort" is only about whether they are run or not, not whether they are run at the expected time.

m-ou-se commented 5 months ago

I believe we don't guarantee anywhere that TLS destructors will run before .join(). (I wonder how much would break if that behaviour changed.)

But if we do want to guarantee that, we could document that scope() does not implicitly call .join(), but instead that it simply waits for the main function of all spawned threads to have returned.

pvillela commented 5 months ago

It would be VERY helpful to have a guarantee that TLS destructors run before a call to join() returns.

On Thu, Mar 14, 2024 at 11:03 Mara Bos @.***> wrote:

I believe we don't guarantee anywhere that TLS destructors will run before .join(). (I wonder how much would break if that behaviour changed.)

But if we do want to guarantee that, we could document that scope() does not implicitly call .join(), but instead that it simply waits for the main function of all spawned threads to have returned.

— Reply to this email directly, view it on GitHub https://github.com/rust-lang/rust/issues/116237#issuecomment-1997666561, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAD4LPLTBGTJQUIUBGDACN3YYG32NAVCNFSM6AAAAAA5LI32UKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOJXGY3DMNJWGE . You are receiving this because you authored the thread.Message ID: @.***>

talchas2 commented 4 months ago

Keeping track of all the join/thread handles in some sort of (linked) list would be problematic, as that'd result in an ever growing list (~a memory leak) when using a long-lived scope

Note that this is already a memory leak for exactly the same sort of reason inside of the OS's thread library. Any joinable thread that has not been joined will require some memory to handle that state and pass through any return value supported by the API.

It would increase the size of the memory leak of course, likely by around 2x at most if the OS has optimized the joinable-but-exited thread state down to minimal amounts.

RalfJung commented 3 months ago

Ouch, yeah that's quite the surprising footgun. Personally I'd rather have a bit of memory accumulation in thread::scope than risk UB due to missing synchronization. As @talchas2 said, this doesn't make any O(1) memory profile into O(n), at least not when taking into account pthread/kernel resource usage -- it's "just" a different constant in an O(n).

bstrie commented 2 months ago

To clarify, is there a consensus on whether this is a soundness bug, or just a documentation bug?

RalfJung commented 2 months ago

Nominating that question for libs-api discussion.

There was a poll among the team and it looks very much like the conclusion is "just a documentation bug", but it'd be good to get an answer that explicitly accounts for @talchas2' comment above.

m-ou-se commented 2 months ago

Note that this is already a memory leak for exactly the same sort of reason inside of the OS's thread library. Any joinable thread that has not been joined will require some memory to handle that state and pass through any return value supported by the API.

We explicitly detach those threads when the join handle is dropped, so those threads are no longer joinable, so are not leaking memory.

talchas2 commented 2 months ago

Yeah, that's true, I missed that.

m-ou-se commented 2 months ago

It would be VERY helpful to have a guarantee that TLS destructors run before a call to join() returns.

That is already the behavior today. Explicitly calling .join() calls pthread_join or similar, which will wait until after TLS destructors have run. (Although we don't explicitly guarantee that. Perhaps we should!)

This issue is about what happens to threads that you have not explicitly .join()ed when exiting the thread::scope in which they were spawned. That currently only waits for the main function of those threads to return. They are not 'joined', because they have already been 'detached' (when the join handle was dropped).

m-ou-se commented 2 months ago

Proposing to close this issue with resolution "not a bug, current behavior at scope exit needs to be properly documented".

@rfcbot close

rfcbot commented 2 months ago

Team member @m-ou-se has proposed to close this. The next step is review by the rest of the tagged team members:

[x] @Amanieu
[ ] @BurntSushi
[x] @dtolnay
[ ] @joshtriplett
[x] @m-ou-se

No concerns currently listed.

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

pvillela commented 2 months ago

@m-ou-se reply:on https://github.com/rust-lang/rust/issues/116237#issuecomment-2217941740:

It would be VERY helpful to have a guarantee that TLS destructors run before a call to join() returns.

That is already the behavior today. Explicitly calling .join() calls pthread_join or similar, which will wait until after TLS destructors have run. (Although we don't explicitly guarantee that. Perhaps we should!)

I have created a separate issue #127571 about this.

rfcbot commented 1 month ago

:bell: This is now entering its final comment period, as per the review above. :bell:

matthieu-m commented 1 month ago

The usecase of a webserver running in scope and creating one thread per connection does pose an interesting challenge, though it may not be insurmountable.

The idea would be to have the threads that are stopping participate to the joining effort. In short:

A single Arc<Mutex<Option<JoinHandle>>> is created per scope (or an array of such mutexes).
When a thread exits (ie, as the last operation of main), injected after the user-run closure:
- It creates a handle to itself.
- It locks the mutex, and swaps its handle for whatever is there.
- If a handle was present, it joins it.
When scope ends, whatever handle is left there is joined.

And that's it. All threads are joined.

I am not sure if the idea is completely feasible, though.

Specifically, there's the issue of auto-join vs user-join. Today one cannot have two JoinHandle to the same thread, and while std could probably work around that (internally use an Arc to an atomic thread-handle, swap it for 0 on first join?) or just use lower-level APIs, it's not clear if it's possible, or what the cost would be.

Advantages:

O(1) memory, possibly O(cores), but still independent of workload.
Works even if the closure running in scope never yields and spawns millions of short-lived threads.
Does not require any active participation from the user.

Disadvantages:

Extra complexity/cost to JoinHandle, beyond scope.
Extra complexity/cost to scope.
Slower termination of threads.
- Could be reduced by having an array of N (2 x cores) mutexes, and using a fast-hash of the thread-id to pick which mutex to go to.

For bonus points, ditch JoinHandle for a single u64/pointer and using an atomic rather than a mutex of option.

rfcbot commented 1 month ago

The final comment period, with a disposition to close, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

rust-lang / rust

Scoped thread implicit join doesn't wait for thread locals to be dropped #116237

Scoped thread implicit join fails 'happens before' guarantee

Meta