Closed Zoxc closed 6 months ago
If you're not explicitly messing around with fibers on a thread, I think threads and fibers are practically the same (the thread could be seen as having exactly one fiber, I guess; or maybe thread and fiber are the same). Also, the FLS functions seem to be consistently used for "per-thread" data, so it's plausible that even when dealing with fibers, things may work out OK. Whether anyone did that, I don't know. If you happen to work with multiple fibers, and ran into problems, maybe describe what's happening?
Thanks @zocx, @res2k. When mimalloc is statically linked, the fiber local storage is used to detect when a thread is terminated.
Thread start: how do we detect a thread start? Well, we don't :-) -- except by checking the thread local variable _mi_default_heap
and initializing it if it is pointing to the default empty heap. So, on a first allocation in a new thread this is done (and all fibers associated with that thread share the thread local _mi_default_heap
). The initialization also calls the prim_thread_associate_default_heap
which sets the fiber local variable to a non NULL
value -- this is for the initial fiber associated with this thread that allocates. Any further fibers associated with this thread won't set the value since the heap itself is still thread local (and is initialized now).
When any fiber is terminated, the mi_fls_done
is called with the current fiber local variable but only when it is not NULL
the mi_thread_done
is called -- so only for the initial fiber the _mi_thread_done
is called.
This works fine -- except if someone would explicitly create a new fiber (or fibers) to run on and then terminate the initial fiber associated with the thread... not sure how we can fix that robustly, or if that is worth fixing. I would say just don't do that and always keep the initial fiber alive while the thread is alive, and only terminate fibers that are subsequently explicitly created. Or link mimalloc dynamically as a DLL as in that case the DLL_THREAD_DETACH
messages are used (and none of the fiber things apply)
Thanks for bringing this up, always good to rethink how this works. Maybe we should keep a thread local fiber count to fix the last point although I am not sure it would be worth fixing since fibers are hardly used nowadays.
There's also this trick / (in chromium). That's what I'm using in my Rust rewrite of mimalloc. It's also helpful in ensuring that the callback happens after the callback which destroys Rust's standard library's thread locals. There's apparently an issue with loaded DLLs though.
There's also this trick / (in chromium).
The thing is, FlsAlloc()
allows to specify a callback, which is essentially a "thread end callback", at least in the (typical) case that fibers aren't really used explicitly. Additionally, since it's part of the Windows API, it's much easier to use this functionality across different compilers than your linked trick, which, as given, would only work on MSVC, perhaps clang, but would certainly need something else for gcc.
Yes, as @res2k mentions, using the special data segment only works for MSVC; mimalloc actually uses that for process initialization detection for static libraries using msvc (see the end of init.c
, using the .CRT$XIU
data section). As such, the fiber API seems the most robust solution (if we use static linking, otherwise using DLL thread detach seems best). Do you happen to have a link to the rust trouble with thread locals?
Are you sure about mingw / GNU not supporting that section? I don't see a fallback path for GNU in Rust's standard library. It's kind of MSVC 6 or UCRT based anyway.
Do you happen to have a link to the rust trouble with thread locals?
Not sure what you mean here. Here's the implementation of thread local destructors in Rust's standard library if that's relevant.
Are you sure about mingw / GNU not supporting that section? I don't see a fallback path for GNU in Rust's standard library. It's kind of MSVC 6 or UCRT based anyway.
Well, to be sure, I tried it with Compiler Explorer: https://godbolt.org/z/4Gv9qdGKK
Unfortunately, MSVC binary execution is not supported on CE. However, compiling & running locally, the output/return code is 1, implying the "thread callback" was executed.
Binary execution is supported for MinGW gcc and clang cases and can be seen directly in CE. It is, in both cases, 0 - ie the callback didn't run.
Also unsurprisingly, enabling warnings for unknown pragmas reports the #pragma comment
s.
It seems to work if you properly specify the segment with __attribute__((section(".CRT$XLB")))
.
Nice -- yes, I think as long you link with the UCRT (the microsoft shared libc) and emit the right linker sections it should work as I think it is eventually just the UCRT that will inspect the linker sections and call the functions in there. It won't work with other libc's though but I guess none exist for Windows (?). As such, this technique should work robustly as well I think, especially if Rust uses it as well :-)
(I wont switch (yet) to this solution for now though -- maybe it won't work with older libc's before UCRT, and I would need to test more as it might change when the tls exit functions are called relative to the FlsAlloc solution.)
(also, I wonder how the UCRT is able to call the TLS exit routines reliably... it must somehow be notified as well ? Ah, I guess UCRT is always a dynamically linked and thus gets DLL_THREAD_DETACH messages.. it would be good to check how this works though)
I am going to close this issue, but thanks for the in-depth analysis!
(I wont switch (yet) to this solution for now though -- maybe it won't work with older libc's before UCRT, and I would need to test more as it might change when the tls exit functions are called relative to the FlsAlloc solution.)
I know this issue is closed but wanted to chime in and confirm that this technique does work on older libc, all the way back to MSVCRT 7 at least.
I use it in a fork of mimalloc that's embedded into a project that is used in an MMORPG client that for legacy reasons, still has a few users running Windows XP (!). Since we use mimalloc as our primary allocator, rather than maintain two builds I patched mimalloc to work on XP by replacing the FLS usage with the _tls_callback method, and verified that the callbacks do indeed execute on a real XP machine.
It's actually slightly cleaner because _tls_callback isn't called on the main thread, so mimalloc doesn't need to jump through hoops to avoid calling _mi_thread_done on the main thread (issue 208). But it may limit compiler support because of the need to emit a special section.
Changeset against 2.0.6: https://github.com/Number5ix/cx/commit/0090303599fed15d9310ddf3b6531420755f049c
prim.c only changes for 2.1.2: https://github.com/Number5ix/cx/commit/79e8605ac4844b186059cce2973761fb7bc206bc
Obviously I don't expect anything like that to be supported upstream since... yeah... it's XP 😉. But thought you might be interested to know.
(also, I wonder how the UCRT is able to call the TLS exit routines reliably... it must somehow be notified as well ? Ah, I guess UCRT is always a dynamically linked and thus gets DLL_THREAD_DETACH messages.. it would be good to check how this works though)
To that point... I have no idea. We statically link both mimalloc and the C runtime on the 32-bit XP compat build (v141_xp toolchain) and it still works somehow.
On Windows, mimalloc uses fiber locals to detect the end of a fiber, then uses that to free the thread local heap, instead of freeing the thread local heap when the thread ends.