How to avoid smuggling of data via TLS and scoped-tls and other means

rust-lang / unsafe-code-guidelines

Forum for discussion about what unsafe code can and can't do

https://rust-lang.github.io/unsafe-code-guidelines

Apache License 2.0

669 stars 58 forks source link

How to avoid smuggling of data via TLS and scoped-tls and other means #484

Open RalfJung opened 11 months ago

RalfJung commented 11 months ago

It seems like there are several cases where APIs would be sound if Rust didn't have thread-local state. scoped-tls expands on that by allowing non-'static types to be stored in TLS.

RalfJung commented 11 months ago

If I understand correctly, the core of the problem is described well by @adamreichold:

I think what the PyO3 discussion shows is that when libraries venture into language territory - as PyO3's handling of Python's heap, GIL and associated invariants seems to do - a way to exert more control over the boundaries which make up a thread would be really helpful.

For example, something like an "as-if" version of thread::spawn which gives me all Rust-level guarantees and type system interactions currently reserved for operating system threads but without the overhead of actually creating a new operation system thread. Though I am not sure whether this is possible at all due to code written e.g. in C might then break these "as-if" barriers similar to the set_env mess?

Admittedly, this is also probably not the right issue to discuss this, but I just wanted to highlight that this is not just a unsafe library X versus Y problem, but a more fundamental question of how libraries can put the type system notions currently bound to operating system threads to their own uses. So maybe a rather long-winded +1 to the parent comment...

Indeed Rust threads are currently tied to OS threads, and TLS makes that visible. What I don't understand is which benefit such an "as-if" thread would have.

Regarding the concrete suggestion: threads don't really have any type system interactions. The Rust compiler doesn't even know what a thread is. Send and Sync are concepts introduced by the standard library, with some hooks for the compiler to enforce them on static variables, but otherwise the Rust type system really doesn't understand what a thread is. The only thing that distinguishes thread::spawn from other functions is its signature. So one can have an "as-if" spawn function by implementing a function with the same signature.

The issue isn't the type system effects, the issue is the runtime effects, specifically, thread-local state. I think what you are asking is, can we "swap out" the thread ID and thread-local state of a thread temporarily, and then later swap the original state back in. From a purely theoretical perspective this is an entirely reasonable operation and I don't think it would be too hard to implement in Miri. What I don't know is if the way thread-local state works on real OSes makes it possible to have a mechanism like that. I think Rust might be fundamentally restricted here by the APIs that platforms provide.

I will note that even without thread_local!, it would be considered sound to send non-Send data via a global static as long as the receiver ensures (e.g. by comparing thread IDs) that it is the same thread as the sender. This is implemented by the send_wrapper crate. So it's not just specifically thread_local!/scoped_thread_local! that's a problem here, it's fundamentally trying to use Send for something that's unrelated to OS threads, which is just in direct contradiction with how Send was designed.

adamreichold commented 11 months ago

Reposting here in a suitably embarrassed state to atone for messing up the list issue:

Hm, TLS is not really a library feature, it's a language feature, and it's not like there is an open question of whether or not we have TLS... but sure I can add it to the list.

I think what the PyO3 discussion shows is that when libraries venture into language territory - as PyO3's handling of Python's heap, GIL and associated invariants seems to do - a way to exert more control over the boundaries which make up a thread would be really helpful.

For example, something like an "as-if" version of thread::spawn which gives me all Rust-level guarantees and type system interactions currently reserved for operating system threads but without the overhead of actually creating a new operation system thread. Though I am not sure whether this is possible at all due to code written e.g. in C might then break these "as-if" barriers similar to the set_env mess?

Admittedly, this is also probably not the right issue to discuss this, but I just wanted to highlight that this is not just a unsafe library X versus Y problem, but a more fundamental question of how libraries can put the type system notions currently bound to operating system threads to their own uses. So maybe a rather long-winded +1 to the parent comment...

pyo3 is abusing Send because they don't have an auto trait that reflects what they really want

Note that scoped_tls is a problem even if pyo3 replaces Send with a custom auto trait: https://github.com/PyO3/pyo3/issues/3640

Note that we are currently leaning towards a solution of both thread-identity-based soundness loopholes by trying to avoid the abuse, i.e. by actually starting a new thread which is admittedly a bit hilarious, c.f. https://github.com/PyO3/pyo3/pull/3646

RalfJung commented 11 months ago

Reposting here in a suitably embarrassed state to atone for messing up the list issue:

No worries, I think this was indeed inevitable, the comment box at the bottom is too tempting. I should probably have made this a wiki page or a markdown file but those are more annoying to maintain...

adamreichold commented 11 months ago

Indeed Rust threads are currently tied to OS threads, and TLS makes that visible. What I don't understand is which benefit such an "as-if" thread would have.

Our current solution to both soundness loopholes works by actually running the closure passed to Python::allow_threads on a separate thread so that TLS cannot be used as a side-channel and wrappers for non-Send data based on thread identity fail because it is indeed a different thread and not just an API that looks like thread::spawn, c.f. https://github.com/PyO3/pyo3/pull/3646

So such an "as-if" spawn would allow me to implement that API without the cost of using actual threads because I do not need parallelism, I just want any state tied to the current thread to be unavailable for the duration of the closure.

I have not figured out whether this would also help other situations like async runtimes which want to use non-Send tasks, i.e. by isolating those tasks from any thread identity which might make it visible to them that they are not actually polled on a freshly started OS thread. But I think the general theme is the same, i.e. give libraries explicit control of the lifecycle of thread-bound data.

Regarding the concrete suggestion: threads don't really have any type system interactions. The Rust compiler doesn't even know what a thread is. Send and Sync are concepts introduced by the standard library, with some hooks for the compiler to enforce them on static variables, but otherwise the Rust type system really doesn't understand what a thread is. The only thing that distinguishes thread::spawn from other functions is its signature. So one can have an "as-if" spawn function by implementing a function with the same signature.

The issue isn't the type system effects, the issue is the runtime effects, specifically, thread-local state. I think what you are asking is, can we "swap out" the thread ID and thread-local state of a thread temporarily, and then later swap the original state back in. From a purely theoretical perspective this is an entirely reasonable operation and I don't think it would be too hard to implement in Miri. What I don't know is if the way thread-local state works on real OSes makes it possible to have a mechanism like that. I think Rust might be fundamentally restricted here by the APIs that platforms provide.

First, yes, I want to run some closure as if it was executed on a freshly started OS thread without having to pay the cost of actually having to start one. This is what I mean by an "as-if" version of thread::spawn, so indeed I am concerned with the runtime effects.

What I meant by the somewhat grandiose "type system interactions" is also basically what you said: Send and Sync have a certain meaning currently tied to OS-level threads. I would like it if libraries could extend (but not modify) this meaning for their purposes. I am not sure what the right mechanism is, but basically I want PyO3 to be able to use Send and extend the isolation properties it provides to other boundaries, i.e. code holding the GIL and code that does not. Similarly, async runtimes may want to establish similar boundaries between tasks instead of OS threads.

TLS and thread ID are the two most obvious ways in which libraries are currently "sabotaged" when they try to create such boundaries because these can be used to "smuggle" data through these boundaries safely. It would be great if there were ways to plug these loopholes at least in such a manner that it would require unsafe code so that libraries can at least fall to the unsafe library X versus Y situation. Not sure if we can get better than that due to interactions with lower level OS facilities, see our old friend set_var.

I will note that even without thread_local!, it would be considered sound to send non-Send data via a global static as long as the receiver ensures (e.g. by comparing thread IDs) that it is the same thread as the sender. This is implemented by the send_wrapper crate. So it's not just specifically thread_local!/scoped_thread_local! that's a problem here, it's fundamentally trying to use Send for something that's unrelated to OS threads, which is just in direct contradiction with how Send was designed.

Indeed, we ran into the send_wrapper issue earlier than into the even worse scoped-tls one. When I am remembering to be precise, I am calling these issues based on "thread identity" instead of "thread-local storage". (There is probably a lot of code out there which manually implements TLS with a specific life cycle by using HashMap<ThreadId, T> with the thread_local crate being the most centralized instance.)

adamreichold commented 11 months ago

Maybe to add: I don't think just using a different auto trait for each of these boundaries will work (for the current PyO3 design it would definitely not work due to scoped-tls), but we do need to piggy back onto the existing traits which already pervade the ecosystem and are close enough insofar they model boundaries between different strands of code execution. I just want libraries to be able to add further more fine-grained boundaries than OS threads.

RalfJung commented 11 months ago

I just want libraries to be able to add further more fine-grained boundaries than OS threads.

Yeah, I get that.

Thinking about what Send/Sync mean logically in my formal model of the Rust type system, I think the point is you want some form of "execution context" where Rust ensures that types cannot, in general, be moved between "execution contexts". These contexts could be entirely fictional in nature, not corresponding to anything that happens on the real machine. (In my formal model, we do have exactly one such execution context per physical thread. But there's actually nothing forcing us to do that. We then defined Send/Sync wrt to these execution contexts.)

With things like thread_local, it is fairly clear that in Rust, Send and Sync are tried to threads, not this broader idea of an "execution context". That doesn't fundamentally stop us from introducing a concept of abstract execution contexts, but it means we'd need a new set of auto traits, e.g. CtxSend/CtxSync -- and adding new auto traits to the language has significant cost. But leaving that aside, we could let types tie their invariant to a particular "execution context", and we could declare it unsound for CtxSend data to move across such an execution context boundary, and we could have a function (that doesn't actually physically do anything) that invokes a closure in a new "execution context".

T: Send would imply T: CtxSend but not vice versa.

This is all theoretically coherent, but the problem is it's a massive breaking change. send_wrapper and thread_local! would have to add CtxSend bounds to ensure that it doesn't accidentally move things across execution contexts. (There's no way to check at runtime in which execution context you are so there's no way to have a sound ctx_send_wrapper.)

I think all this gives me a better understanding of the underlying issue, but I think this is only solvable in two ways (even if we had a time machine):

With a global cost to the entire ecosystem, adding two new auto traits and requiring trait bounds for things like thread_local! and send_wrapper.
With a way to make the "things that expose thread identity" respect "execution contexts" (basically removing the distinction of Send and CtxSend). This requires ensuring that the thread ID and thread-local state inside a new execution context is independent from the outside. I don't known if there's any way to implement this that is cheaper than spawning a thread. I have no idea how thread-local storage is actually implemented.^^ But on Linux at least, the thread ID is AFAIK an ID managed by the kernel, so it can't easily be "faked". We could potentially fake what Rust's Thread::id returns but that would probably have bad consequences for code that needs this to be the kernel ID, and FFI code would ofc still see the actual thread ID. Even if we ignore thread IDs (declaring send_wrapper unsound for neglecting to take into account the existence of "execution contexts"), what would it take to generate a fresh set of thread-local state? Is that even possible?

So... it's not looking good. :/ You are asking for a rather big language change, and even with the benefit of hindsight it's unclear if Rust 1.0 could have reasonably done any better.

adamreichold commented 11 months ago

With a way to make the "things that expose thread identity" respect "execution contexts" (basically removing the distinction of Send and CtxSend). This requires ensuring that the thread ID and thread-local state inside a new execution context is independent from the outside. I don't known if there's any way to implement this that is cheaper than spawning a thread. I have no idea how thread-local storage is actually implemented.^^ But on Linux at least, the thread ID is AFAIK an ID managed by the kernel, so it can't easily be "faked". We could potentially fake what Rust's Thread::id returns but that would probably have bad consequences for code that needs this to be the kernel ID, and FFI code would ofc still see the actual thread ID. Even if we ignore thread IDs (declaring send_wrapper unsound for neglecting to take into account the existence of "execution contexts"), what would it take to generate a fresh set of thread-local state? Is that even possible?

I think the FFI angle is actually the biggest problem. Because everything else is already done as part of creating threads, whether in the kernel or in the language runtimes. Whether it is actually so much cheaper as to be worth the hassle is of course an interesting follow-up question.

This also somewhat reminds me of the mmap discussion and a missing MAP_SNAPSHOT flag/behaviour: If we had co-evolution of kernel and language, I think this would not even need to a language change in the strict sense, i.e. the kernel could have a variant of clone which sets up the thread context without an independent flow of execution so that even the lowest level C API would think it lives on this "as-if" thread. Of course, this alone would not help implementing this on other systems, but there is always the fallback of actually starting a thread.

(Also, there were user mode threading libraries on Linux before the whole NPTL thing. Not sure how well that worked without integration with the C library though...)

(Concerning the async use case, while always a new thread might avoid smuggling data, I guess the even better thing would be to "reify" the thread state so that the runtimes could pass it around with the tasks and install it before polling them. This would make thread identity fully programmable. But indeed, a rather hard problem without kernel co-evolution to make it happen at the lowest level.)

EDIT: Just to add: Yes, personally, I very much do not want to add CtxSend/CtxSync and split the ecosystem. I rather want this to be a kernel and library than a language change.

adamreichold commented 11 months ago

I am really torn now whether I should feel bad because I am advocating for setjmp/longjmp on the thread level or feel nice because I am want the Linux kernel to grow a call/cc. 😺

RalfJung commented 11 months ago

Can you link to a description of why you even care about this kind of data smuggling in the first place? This does strongly feel like an XY problem, ultimately what I think you really want is to reason about the "capability" of having acquired the GIL, or something like that. You are just currently implementing that "capability" in a particular way and then there's problems with that implementation. If Rust had more native support for such "capabilities" then maybe you wouldn't need to worry about data smuggling?

steffahn commented 11 months ago

See also my post here.

steffahn commented 11 months ago

With a global cost to the entire ecosystem, adding two new auto traits and requiring trait bounds for things like thread_local! and send_wrapper.

I don’t think that would necessarily be all that bad, if we had a time-machine.

adamreichold commented 11 months ago

Can you link to a description of why you even care about this kind of data smuggling in the first place?

I think the best write-ups of the status quo for PyO3 are https://pyo3.rs/v0.20.0/types and the documentations of mod marker and its contents.

And yes, underlying this is most likely a "lack of capabilities" kind of issue, i.e. we have reified holding the GIL into a zero-sized token type Python<'py>, but having to present that token for each and every call which interacts with the Python interpreter is highly unergonomic which is why we have GIL-bound references &'py T which are crated from reference-counted pointers into the Python heap Py<T> using that token and can then be used freely to interact with the interpreter (or even produce a new token if you only have such a reference which is highly useful when implementing traits).

So as @davidhewitt mentioned over at PyO3, if we had language support for "holding the GIL" as a context/capability so that user code would not have to pass the GIL token manually, this problem would likely go away or at least look very different. So in that sense, I agree with the description as an XY problem.

But then again, in the long-term, the GIL is apparently on its way out and maybe we should not cater to it too much. And the async runtime use case probably does need more than just context/capability support because it is not just a syntactical problem there. Finally, I think reifying implicit context and exposing it to user code is often useful beyond its immediate applications, so that in summary I think reifying thread identity in Rust is something worth discussing even if PyO3 didn't exist.

adamreichold commented 11 months ago

See also my post here.

I think this fits well with how @RalfJung is discussing it and how this issue is named.

I personally am probably looking for "thread context as a value" or "first class threads", i.e. making the underlying OS service more programmable rather than adding more language to create new kinds of abstractions. (Rust getting even more complex, oh my...)

Of course, wanting something does not make it sensible or feasible or actually implements it, so I guess I'll have to take my O_PONIES for a walk... (Linux is still lacking an API to create a process/thread, get a handle and modify it through that after all, having to resort to fork/exec or clone and its army of flags to implement posix_spawn.)

RalfJung commented 11 months ago

See also my post here.

So the idea I posted above has indeed been posted twice before in the last few days; in your post and here. Interesting. :D

I don’t think that would necessarily be all that bad, if we had a time-machine.

I am not sure the lang team would have accepted the cost of two more auto traits. But I guess that's kind of pointless speculation anyway.

I personally am probably looking for "thread context as a value" or "first class threads", i.e. making the underlying OS service more programmable rather than adding more language to create new kinds of abstractions. (Rust getting even more complex, oh my...)

Well you are asking for it to get even more complex. ;) You are just starting from the operational semantics side and Seffahn is starting from the type system side. You are talking about the same thing though. All that effect system business comes up because they are trying to retrofit this into an existing language.

That only takes care of TLS though, not thread IDs. However, I had a look at the docs, and actually the Rust thread ID is already not the OS thread ID. It's a unique ID generated with a global Rust-private counter. So we could totally say that each "execution context"/"capsule" gets its own Thread ID. If we tie that together with the ability to say "this code does not use tls", then this entire thing could be done without new auto traits -- using the existing Send/Sync traits to bound the closure running in the "as-if" thread, giving that "thread" a new thread ID, and not letting it access TLS at all. (Though not being able to access TLS might be a serious limitation.) We'd have to declare that one may not use OS-level thread identifiers to make conclusions about Send/Sync, which is a bit odd but also, the existence of the Rust thread ID makes it unlikely that people would rely on that I think?

steffahn commented 11 months ago

So the idea I posted above has indeed been posted twice before in the last few days; in your post and here. Interesting. :D

Not a coincidence that it was "twice" - the IRLO thread was a comment thread to madklad's post. Even though I've described mostly ideas I've had for myself a few months ago, I've expressed them there now because it seemed like a highly similar / related idea.

I had also opened the new pyo3+scoped-tls issue which I'd thought up in the context of looking back at the topic due to my reply on IRLO, so that issue coming up at this same time is also no coincidence.

adamreichold commented 11 months ago

Well you are asking for it to get even more complex. ;) You are just starting from the operational semantics side and Seffahn is starting from the type system side. You are talking about the same thing though. All that effect system business comes up because they are trying to retrofit this into an existing language.

I am not sure. In my little head, the language does not change at all and I just get a function to replace the current thread ID/TLS block, putting it behind an opaque handle and atomically installing a new thread ID/TLS block on the current OS thread, something like fn replace_thread_identity(new: ThreadIdentity) -> ThreadIdentity;. This is surely hard to implement and to work out any unintended consequences of these runtime changes, but ideally it would not require e.g. any new auto traits or changes to the existing semantics of e.g. thread_local!.

If we tie that together with the ability to say "this code does not use tls", then this entire thing could be done without new auto traits -- using the existing Send/Sync traits to bound the closure running in the "as-if" thread, giving that "thread" a new thread ID, and not letting it access TLS at all.

In that thread-identity-can-be-replaced world, I think one would not need to prohibit TLS as its scope could be controlled, e.g. the TLS handle could be installed by the async runtime to follow a task as migrates between OS threads. Or PyO3 could always install a fresh one to close any unwanted side channels.

By the way, the origin crate has a pure Rust implementation of the thread lifecycle which includes TLS setup: https://github.com/sunfishcode/origin/blob/main/src/thread/linux_raw.rs It does involve a bit of linker support for statically initialized TLS variables, but otherwise it seems to be something the standard library could implement in user mode. The TLS block just has to be placed at the right address relative to the TCB so that replacing it would involve memcpying the TLS block elsewhere.

adamreichold commented 11 months ago

This looks somewhat related as well: https://github.com/rust-lang/rust/pull/117873

adamreichold commented 11 months ago

(For the async runtime use case, I think this could even become an optional effort/optimization: Send tasks don't care about thread identity and can always run with the original OS thread's identity (as they do now) and only non-Send tasks would need to bring their own TLS block which needs to be installed before they are polled. Theoretically this overhead could also be lowered by grouping non-Send tasks to share a thread identity which would be fine as long as the relationship is stable.)

RalfJung commented 11 months ago

I am not sure. In my little head, the language does not change at all and I just get a function to replace the current thread ID/TLS block, putting it behind an opaque handle and atomically installing a new thread ID/TLS block on the current OS thread, something like fn replace_thread_identity(new: ThreadIdentity) -> ThreadIdentity;. This is surely hard to implement and to work out any unintended consequences of these runtime changes, but ideally it would not require e.g. any new auto traits or changes to the existing semantics of e.g. thread_local!.

Ah, I see. I think Steffahn was working under the assumption that that's not possible with reasonable performance.

This is waay outside my area of expertise so I can't really say whether it is realistic or not. Remember this needs to be supported on all platforms Rust compiles to.

adamreichold commented 11 months ago

Having had time to read that now, I think what I am trying to propose is also hinted at in question 3 of @matklad's https://matklad.github.io/2023/12/10/nsfw.html#Four-Questions

matklad commented 11 months ago

I think this is only solvable in two ways (even if we had a time machine):

in the Time Machine category, I think there’s a third self-consistent way:

define Send and Sync in terms of abstract concurrency rather than OS threads
remove thread_local macro from the stdlib
add a bare-bones #[thread_local] static, unsafe a-la static mut.
make it user’s problem to arrange for lazy initialization, drop at thread exit, and re-entrancy detection.

That’s drastic, but, given that thread locals are a niche feature (there are like entire languages without thread locals), and that “fixing” Send removes the need to track sendness of futures and thus the need to have return type notation and such, I would be ready to vehemently argue for this setup in 2014! Will tackle that right after the Hawking party, if I ever come by a Time Machine!

RalfJung commented 11 months ago

For Rust to be a viable alternative to C or C++, it absolutely must support TLS. Even rustc itself uses it. So that is not an option I think.

ChrisDenton commented 11 months ago

A bare bones #[thread_local] static would be supporting TLS. We avoided the C++ issues with ordinary statics having runtime initializers and destructors (by pushing that onto users) so we could have done the same with thread locals too, no?

RalfJung commented 11 months ago

That would not have helped, you can implement "lazy TLS" on top of barebones TLS. You need something slightly more tricky anyway to avoid references outliving the current thread.

RalfJung commented 11 months ago

Ah you need destructors yes. But you do need those I think, after all TLS gets deallocated. Regular statics avoid that issue by virtue of never being deallocated.

matklad commented 11 months ago

For Rust to be a viable alternative to C or C++, it absolutely must support TLS.

I’d say that’s an argument in favor of bare-bones unsafe, but zero cost thread locals ^^ Rust thread_local macro is safe, but it comes with runtime overhead. Up until very recently (when we added const-initialization to thread locals) it was impossible to implement a performant global allocator in Rust, because, for an allocator, you need fast TLS:

https://matklad.github.io/2020/10/03/fast-thread-locals-in-rust.html

RalfJung commented 11 months ago

Even if they are unsafe they need a reasonable safety contract. So I don't think that would have helped, we would not have included Sync in that contract since, well, it's thread-local.

dewert99 commented 3 months ago

A slight alternative to the CtxSend/CtxSync auto trait idea, would be to not enforce a relation with regular Send/Sync, this way SendWrapper wouldn't need to change anything (it implements Send but not always CtxSend).

Adding these traits would only break crates that rely on thread local dynamic types since all existing concrete (non-dynamic) types would already implement CtxSync. I'm not sure how common this is.

Another hacky idea that doesn't rely on language changes would be to lean into the idea that non 'static can't be smuggled without external crates, and a third-party crate define CtxSend/CtxSync, and have scoped-tls and scoped-tls-hkt depend on this crate to add bounds to there types. To make things even hackier the crate defining CtxSend/CtxSync could have a feature flag and make CtxSend/CtxSync normal traits with blanket implementations for all types when the feature is disabled. This would allow scoped-tls and scoped-tls-hkt to not require the feature (and therefore not need nightly) by default, but allow crates that rely on CtxSend/CtxSync for soundness to require feature. This would be abusing the feature resolver since the feature isn't additive and having a dependency require the feature could cause an unreated to stop compiling if it used scoped-tls with a dynamic type.

I also thought of more use cases for CtxSend/CtxSync:

A zero overhead cell that allows accessing a mutable reference to the stored data


#[repr(transparent)]
pub struct ScopeCell<T: CtxSend>(UnsafeCell<T>, PhantomNotCtxSync);

impl ScopeCell { pub fn new(t: T) -> Self { ScopeCell(UnsafeCell::new(t), PhantomCapability) }

pub fn with_mut<U>(&self, f: impl FnOnce(&mut T) -> U + CtxSend) -> U {
    // Safety:
    // the closure cannot capture a copy of `self` since it implements `CtxSend`, but `&Self` doesn't
    // `self` cannot contain a self reference since `T` implements `CtxSend` but `&Self` doesn't
    // a copy of `self` cannot be taken from thread local memory since `&Self` doesn't implement `CtxSend`
    //
    // Since it is impossible for the call to `f` to access a copy of `self`,
    // it can't call `with_mut` again to get an aliasing mutable reference
    unsafe{f(&mut *self.0.get())}
}

}


Are there any other reasons this would be unsound?
2. A thread local stack allocator I wrote about https://users.rust-lang.org/t/unsafe-code-review-thread-local-stack-allocator-second-stack2-stacked-borrows-violation/115348, but now realize would be unsound with using `CtxSend`.