rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
97.73k stars 12.64k forks source link

`thread_local!` dtor registration can pass wrong `__dso_handle` if dynamically linked #88737

Open thomcc opened 3 years ago

thomcc commented 3 years ago

On linux and some other platforms, libstd uses __cxa_thread_atexit_impl to register destructors for thread locals.

https://github.com/rust-lang/rust/blob/fdf65053e99e8966f9bd83b5a8491326cb33d638/library/std/src/sys/unix/thread_local_dtor.rs#L39

The last argument to this is &__dso_handle, but this is only correct for threadlocals which are inside libstd, or are in code that is statically linked with libstd.

__dso_handle is a magic symbol which has a value that is unique to whatever DSO is references it. That is, if libfoo.so and libbar.so both look at __dso_handle (or &__dso_handle, which is really the value you use), it will have a different value in each.

Yes, I know this isn't how things usually work, this is the entire point of __dso_handle (technically, it behaves as if a symbol with "hidden" visibility named __dso_handle were declared inside each DSO automatically, see https://itanium-cxx-abi.github.io/cxx-abi/abi.html#dso-dtor-runtime-api, although this is clearly a hack).

It's behaves very slightly differently depending on a number of moving pieces (libdl, libc, libcxxabi, the linker, the runtime loader, ... — collectively I'm going to call these "the runtime"), and shows up in a couple different APIs, but here it's being used to remember that that DSO has a pending thread-local dtor, which prevents the DSO from being unloaded until after the said dtors are all run (when all the threads in question are closed).

So, to the point: libstd always registers this using a __dso_handle which is linked from inside itself. This defeats the point of the symbol, as now "the runtime" believes that the DSO containing libstd is the one responsible for the dtor. This can cause problems in scanarios where libstd is dynamically linked, and dlopen/dlclose is used to dynamically load rust code. (See "Memory unsafe scenario" for why)

I believe the ideal fix here is to have thread_local! expand to contain the extern for __dso_handle on these systems. Then, &__dso_handle would be passed in as an argument to the call to unix::thread_local_dtor::register_dtor. I don't know how this interacts with weak symbols, but I'm sure this can be made to work.

(This is... inconvenient, but it's not that surprising — if library code could be the source for this value, there'd be no need for it to get passed in)

Memory unsafe scenario

Concretely, I think this can lead to a concerning memory unsafety problem in the following scenario:

  1. libstd is dynamically linked into a program.

  2. Some rust library (which also dynamically links libstd) is loaded via dlopen. Let's call this libmycrate.so for concreteness.

  3. libmycrate.so contains a thread_local! (mycrate::THE_THREAD_LOCAL) that needs its dtor to be registered.

  4. A thread T0 is spawned, and T0 calls some function in libcrate.so.

  5. This function references mycrate::THE_THREAD_LOCAL, which causes the destructor is registered via __cxa_thread_atexit_impl (inside std::sys::unix::register_dtor)

    • Note: In this hypothetical, no other threads have registered dtors for mycrate::THE_THREAD_LOCAL.
  6. The library libcrate.so is unloaded via dlclose. This is prior to T0 ending, and it is not the last Rust crate to be unloaded.

  7. Later T0 is joined, which runs the thread-specific destructors. This includes mycrate::THE_THREAD_LOCAL's dtor, despite the fact that it has been unloaded.

    • That is, the memory for the dtor function (and internal functions it calls) may be unmapped/freed/in use as something else.
    • Any static data accessed and the like may similarly be no longer alive

Note: between 6 and 7, some time may have to pass; dlclose is often performed in the background. Also, I'm assuming in this situation that libmycrate.so hasn't done anything else to prevent being unloaded. Finally, often the memory from the library is pushed onto a free list for later use, rather than actually being unmapped.

Anyway, this is concerning because:


On the other hand, this doesn't exist from purely safe stdlib APIs — someone had to unsafely call dlclose (perhaps by Droping a libloading::Library), so it's on them.

While I don't find this style of argument compelling, it unfortunately has to be the answer to some extent. We can't fix this everywhere, as only some platforms allow defending against this by accepting an equivalent to &__dso_handle.

That said, this is clearly an example of us passing the wrong value, and I suspect there aren't really great arguments against fixing it. I think this is actually quite a bit of a footgun on platforms where it can't be addressed, but probably the solution is to somehow let people know that dlclose (and equivalent) are extremely spooky.

thomcc commented 2 years ago

@rustbot modify labels: +A-thread-locals +T-libs

zifeo commented 2 years ago

@thomcc I believe I am facing this behaviour with a cdylib loaded from Deno. Only happens on Deno tests with dev compilation (release is fine). Issue does not arise on macOS (arm) but can be found on Debian and WSL2 Debian.

However I have really a hard-time making sure this is indeed affected by your scenario. Is there some tests I can operate to dig further (I am not familiar with Rust internals)? Is there a workaround known?

thomcc commented 2 years ago

The best workaround is for you to not unload dynamic libraries, if at all possible. Honestly, we don't really guarantee that it's supported for libstd (it should be for libcore/liballoc), and it's extremely dangerous to unload a dynamic library that doesn't support being unloaded.

nagisa commented 2 years ago

Woah. I can’t believe it is only now that I’m seeing this. This sounds like something we should definitely look into fixing and seeing just how many of the issues this would end up resolving on the relevant platforms.

We can't fix this everywhere, as only some platforms allow defending against this by accepting an equivalent to &__dso_handle.

I believe that this sort of scenario is something that will come up as platform implementations approach some level of maturity. macOS, for example does not use the __dso_handle equivalent. Instead they recently started implicitly leaking any dlclosed DSOs that contain code for any thread-local destructor. Ultimately if the platform cannot support that sort of use-case well, there’s nothing we can do about it, but it is definitely worth it to make sure that the platforms which support this sort of stuff work well.

thomcc commented 2 years ago

So this is actually much more broken than this issue lets on. I've had doing a writeup on it on my TODO for a bit now, so I guess it's time. The TLDR is:

  1. The case in this issue (where only part is unloaded) is busted but we probably don't want to support it anyway. If we fix the bug in this issue, it will become very rare, as dlclose impls will often just skip a lib which has registered a TLS dtor.

  2. The more general case where Rust is fully unloaded is often broken too (https://github.com/rust-lang/rust/issues/91979), as on many targets we pass function pointers to things like pthread_key_create. If we are unloaded before they run, then boom.

  3. I don't see a complete fix for 1, but we do some things that make it worse than it needs to be. On some targets we can fix 2, but I don't think on all.

    Perhaps the solution is to better document the fact that unloading libraries is very dangerous, especially multithreaded ones (like all rust libraries), and double-especially if they don't promise to be unloadable, and you've called into them. This doesn't really help cases like Node.js, though...

The longer version follows:

Partial unloading is broken and nonsensical.

So, the partial-unload case seems really too busted to be supported at all. That is, the case in this issue, where "librustcode.so gets unloaded but the libstd.so stays loaded, and the former passed functions to the latter, such as TLS dtors".

Like, there's not much we can do about the fact 'static makes promises that can't be upheld by code that may be unloaded. The downside here, it's kind of the worst case of UB for exploits. Thread locals for types from the which didn't get unloaded (things from the stdlib) will have their Drop run on stale memory, and everything else are stale function pointers which will be invoked. To make things worse, it's non-deterministic too.

The saving grace here is that if we pass the __dso_handle from the right library, that library will in practice just not be unloaded even if dlclose is called. So we should really be sure to do that, since it completely fixes that issue and makes it so that &'static does what it says. (That is, correct use of __dso_handle basically prevents partial unloading like this from happening, although it's rare in any case)

Full unloading is broken too, but trickier

Sadly, for the case where were std and user code are statically linked, we do what we can in some cases, but it still leaves many platforms where we actually hit UB when libstd is unloaded. This means that on many platforms (probably most), Rust cdylib crates are essentially always unsound to unload (unless they're no_std)

For example, this is true on any platform where we're using pthread keys for TLS (it's probably the same story on any non-static TLS platform), as we pass a function pointer to libc::pthread_key_create, and it gets called at the end of the thread, even if we've been unloaded. It's basically the same as the classic "passed dtor to atexit then got unloaded" case -- just it was pthread_key_create not atexit. Libraries that do this must not ever be unloaded, which includes anything that statically links to libstd on these platforms. That said, it's common for concurrent C code to have this issue too -- as it also uses thread locals more heavily.

That said, in some cases we can do a better thing here, but don't[^old], and you're mistaken that macOS doesn't have a solution for this.

On macOS, the _tlv_atexit function (which we're already using) takes the the __dso_handle (which is a pointer to the mach header) as the second arg, and would do the right thing if we passed it in. Some care has to be taken as our current scheme (to avoid calling _tlv_atexit from a thread that's currently being destroyed) might need tweaking, especially if we take __dso_handle from the site of the thread_local! use. (Note that while _tlv_atexit can't be used on iOS, apps that use dlopen tend to get told not to do so -- it's allowed, just highly discouraged, so doubtful that this kind of UB is likely)

[^old]: Much of this code seems to date back to being a green thread runtime, so its probably justified that it wasn't thinking about things like this.

FWIW, I believe Windows has a relevant story here too. I think it has similar issues for dynamic TLS, and different issues ones for static TLS? But I may be mis-remembering, @ChrisDenton would know more, as a while ago, I reached out to ask about the situation on Windows, and we discussed this in some detail.

Overall

So, this basically sucks, as many platforms don't offer a way to solve this. In practice most of what we get is the ability to stop the unloading from happening (totally fine -- nobody should call dlclose anyway), but only sometimes. We can improve things slightly for both cases described, and should, even if this still is a "please never do this" thing, but it still leaves many targets in a bad way.

Maybe some of them have a __cxa_thread_atexit_impl equivalent in a future OS version? This generally comes with #[thread_local]/C++ thread_local (see also https://github.com/rust-lang/rust/issues/91659), as C++ wants a way to run dtors for TLS and still be unloadable. Note that on several of these platforms, C++ code that tries to use that language feature just can't be compiled, though.

Alternatively, the main reason we want to replace what we have now with __cxa_thread_atexit_impl-likes is that it prevents the module from being unloaded at all. Perhaps there's some other way to force that (somehow bumping the refcount?) Didn't find anything that seemed , when I looked, though.

Anyway, while did a bunch of research on this at one point, dlclose has always been completely broken in the face of thread local storage, so a lot of things just say not to do it, which is to say, I could totally be missing something! It's totally possible that some other mitigation exists that would allow avoiding the problem in a way that is more safe.

[^2]: Well, on many targets it will actually not unload anything. But for the most part, in the cases where it does unload, it's probably broken, likely very.


P.S. One last bug in the __dso_handle code thats worth noting/fixing: Because of the foot-guny way extern_weak works, it actually already provides the symbol address or null, so the &__dso_handle is actually wrong! __dso_handle is already either null or a pointer to the __dso_handle symbol. That is, it already had its address taken (because in principal it could be a pointer to something that is not zeroable), and the way we're using it is one indirection too many. I... kind of hate this, because it looks wrong and is super subtle, but I suppose it makes the extern_weak-on-function-pointer case slightly better.

thomcc commented 2 years ago

Instead they recently started implicitly leaking any dlclosed DSOs that contain code for any thread-local destructor

To be clear, I think this is good behavior, and why we should fix code here to pass in the right __dso_handle (so that it gets leaked too). Notably, if the dlclosed DSO is leaked in that case, then 'static actually functions as advertised, and that code doesn't have a huge memory safety problem just because someone mistakenly assumed they could unload it.

Anyway, dlclose is fundamentally broken in the face of thread local destructors. I mean, how is it supposed to work? None of the options of seem good: wait for the thread to join (it might never join), destroy the object early (difficult to do on the right thread, may cause races or deadlocks, and some libraries[^1] may not expect a thread local dtor to run way before thread shutdown ...), etc.

[^1]: I would guess that some of libstd's thread_local_dtor code does not handle that correctly in at least one of the (several) configuration, nor does it need to.

ChrisDenton commented 2 years ago

The Windows side of things is something I have been meaning to go into more when I have some time to really grapple with it. However, I feel like Windows is different enough that I would need to provide quite a bit of background information to avoid misunderstandings. For now I've quickly dashed off an overview, though it doesn't really go into all the details. It may also contain errors, but I hope not.

Windows TLS Primer.

thomcc commented 1 year ago

Another issue that I thought I mentioned here but dont seem to have is that #[linkage = "extern_weak"] means the __dso_handle already has a level of indirection, so &__dso_handle is extra wrong.