microsoft / mimalloc

mimalloc is a compact general purpose allocator with excellent performance.
MIT License
9.74k stars 791 forks source link

Mismatched use of fiber and thread locals #869

Closed Zoxc closed 1 month ago

Zoxc commented 3 months ago

On Windows, mimalloc uses fiber locals to detect the end of a fiber, then uses that to free the thread local heap, instead of freeing the thread local heap when the thread ends.

res2k commented 3 months ago

If you're not explicitly messing around with fibers on a thread, I think threads and fibers are practically the same (the thread could be seen as having exactly one fiber, I guess; or maybe thread and fiber are the same). Also, the FLS functions seem to be consistently used for "per-thread" data, so it's plausible that even when dealing with fibers, things may work out OK. Whether anyone did that, I don't know. If you happen to work with multiple fibers, and ran into problems, maybe describe what's happening?

daanx commented 3 months ago

Thanks @zocx, @res2k. When mimalloc is statically linked, the fiber local storage is used to detect when a thread is terminated.

Thanks for bringing this up, always good to rethink how this works. Maybe we should keep a thread local fiber count to fix the last point although I am not sure it would be worth fixing since fibers are hardly used nowadays.

Zoxc commented 3 months ago

There's also this trick / (in chromium). That's what I'm using in my Rust rewrite of mimalloc. It's also helpful in ensuring that the callback happens after the callback which destroys Rust's standard library's thread locals. There's apparently an issue with loaded DLLs though.

res2k commented 3 months ago

There's also this trick / (in chromium).

The thing is, FlsAlloc() allows to specify a callback, which is essentially a "thread end callback", at least in the (typical) case that fibers aren't really used explicitly. Additionally, since it's part of the Windows API, it's much easier to use this functionality across different compilers than your linked trick, which, as given, would only work on MSVC, perhaps clang, but would certainly need something else for gcc.

daanx commented 3 months ago

Yes, as @res2k mentions, using the special data segment only works for MSVC; mimalloc actually uses that for process initialization detection for static libraries using msvc (see the end of init.c, using the .CRT$XIU data section). As such, the fiber API seems the most robust solution (if we use static linking, otherwise using DLL thread detach seems best). Do you happen to have a link to the rust trouble with thread locals?

Zoxc commented 3 months ago

Are you sure about mingw / GNU not supporting that section? I don't see a fallback path for GNU in Rust's standard library. It's kind of MSVC 6 or UCRT based anyway.

Do you happen to have a link to the rust trouble with thread locals?

Not sure what you mean here. Here's the implementation of thread local destructors in Rust's standard library if that's relevant.

res2k commented 3 months ago

Are you sure about mingw / GNU not supporting that section? I don't see a fallback path for GNU in Rust's standard library. It's kind of MSVC 6 or UCRT based anyway.

Well, to be sure, I tried it with Compiler Explorer: https://godbolt.org/z/4Gv9qdGKK Unfortunately, MSVC binary execution is not supported on CE. However, compiling & running locally, the output/return code is 1, implying the "thread callback" was executed. Binary execution is supported for MinGW gcc and clang cases and can be seen directly in CE. It is, in both cases, 0 - ie the callback didn't run. Also unsurprisingly, enabling warnings for unknown pragmas reports the #pragma comments.

Zoxc commented 3 months ago

It seems to work if you properly specify the segment with __attribute__((section(".CRT$XLB"))).

daanx commented 3 months ago

Nice -- yes, I think as long you link with the UCRT (the microsoft shared libc) and emit the right linker sections it should work as I think it is eventually just the UCRT that will inspect the linker sections and call the functions in there. It won't work with other libc's though but I guess none exist for Windows (?). As such, this technique should work robustly as well I think, especially if Rust uses it as well :-)

(I wont switch (yet) to this solution for now though -- maybe it won't work with older libc's before UCRT, and I would need to test more as it might change when the tls exit functions are called relative to the FlsAlloc solution.)

(also, I wonder how the UCRT is able to call the TLS exit routines reliably... it must somehow be notified as well ? Ah, I guess UCRT is always a dynamically linked and thus gets DLL_THREAD_DETACH messages.. it would be good to check how this works though)

daanx commented 1 month ago

I am going to close this issue, but thanks for the in-depth analysis!

Number5ix commented 3 weeks ago

(I wont switch (yet) to this solution for now though -- maybe it won't work with older libc's before UCRT, and I would need to test more as it might change when the tls exit functions are called relative to the FlsAlloc solution.)

I know this issue is closed but wanted to chime in and confirm that this technique does work on older libc, all the way back to MSVCRT 7 at least.

I use it in a fork of mimalloc that's embedded into a project that is used in an MMORPG client that for legacy reasons, still has a few users running Windows XP (!). Since we use mimalloc as our primary allocator, rather than maintain two builds I patched mimalloc to work on XP by replacing the FLS usage with the _tls_callback method, and verified that the callbacks do indeed execute on a real XP machine.

It's actually slightly cleaner because _tls_callback isn't called on the main thread, so mimalloc doesn't need to jump through hoops to avoid calling _mi_thread_done on the main thread (issue 208). But it may limit compiler support because of the need to emit a special section.

Changeset against 2.0.6: https://github.com/Number5ix/cx/commit/0090303599fed15d9310ddf3b6531420755f049c

prim.c only changes for 2.1.2: https://github.com/Number5ix/cx/commit/79e8605ac4844b186059cce2973761fb7bc206bc

Obviously I don't expect anything like that to be supported upstream since... yeah... it's XP 😉. But thought you might be interested to know.

(also, I wonder how the UCRT is able to call the TLS exit routines reliably... it must somehow be notified as well ? Ah, I guess UCRT is always a dynamically linked and thus gets DLL_THREAD_DETACH messages.. it would be good to check how this works though)

To that point... I have no idea. We statically link both mimalloc and the C runtime on the 32-bit XP compat build (v141_xp toolchain) and it still works somehow.