microsoft / mimalloc

mimalloc is a compact general purpose allocator with excellent performance.
MIT License
10.37k stars 839 forks source link

control over non-api symbol declarations #923

Open q66 opened 1 month ago

q66 commented 1 month ago

I'm currently evaluating using mimalloc as libc allocator in Chimera Linux (https://chimera-linux.org). Currently, we have LLVM's Scudo integrated into our libc but testing shows that mimalloc may be a better fit.

However, the API of mimalloc is extensive, while I only require to expose a handful of symbols. I'd like to make the other symbols local in order to avoid polluting the static libc.a (in shared libc.so, symbol visibility takes care of it already). In order to integrate the allocator I am using the object build (i.e. everything is a single translation unit) so I can easily mark every function (except my specific allocator entrypoints) static and get what I want.

For public API symbols, handling this is easy; I just do #define mi_decl_export static and that takes care of it. For internal symbols which do not have a tag this is harder; currently I have to patch every declaration the headers. Would it be possible to have a (by default empty) macro tagging these internal symbols so I get to reduce the amount of integration patching I have to do downstream? This should apply to any functions as well as extern variables; e.g. my current mimalloc.o has

q66@chimera: /home/q66/mimalloc/build$ nm mimalloc.o|grep ' [TBR] '                             
0000000000000760 T __libc_calloc
0000000000000800 T __libc_free
0000000000000960 T __libc_malloc_impl
00000000000009b0 T __libc_realloc
0000000000000220 T __malloc_atfork
0000000000000750 T __malloc_donate
0000000000000010 T __malloc_init
0000000000000210 T __malloc_thread_init
0000000000000230 T __malloc_tls_teardown
0000000000000ba0 T aligned_alloc
0000000000000c40 T malloc_usable_size

(all of which are my symbols that hook into the libc internally, except the last two which become libc public API)

q66 commented 1 month ago

to show what kind of thing i currently have, e.g.

--- a/include/mimalloc/internal.h
+++ b/include/mimalloc/internal.h
@@ -14,6 +14,12 @@ terms of the MIT license. A copy of the license can be found in the file
 // functions and macros.
 // --------------------------------------------------------------------------

+#ifdef MI_LIBC_BUILD
+#define mi_decl_internal static
+#else
+#define mi_decl_internal extern
+#endif
+
 #include "types.h"
 #include "track.h"

@@ -55,80 +61,80 @@ terms of the MIT license. A copy of the license can be found in the file

 // "options.c"
-void       _mi_fputs(mi_output_fun* out, void* arg, const char* prefix, const char* message);
-void       _mi_fprintf(mi_output_fun* out, void* arg, const char* fmt, ...);
-void       _mi_warning_message(const char* fmt, ...);
-void       _mi_verbose_message(const char* fmt, ...);
-void       _mi_trace_message(const char* fmt, ...);
-void       _mi_options_init(void);
-void       _mi_error_message(int err, const char* fmt, ...);
+mi_decl_internal void       _mi_fputs(mi_output_fun* out, void* arg, const char* prefix, const char* message);
+mi_decl_internal void       _mi_fprintf(mi_output_fun* out, void* arg, const char* fmt, ...);
+mi_decl_internal void       _mi_warning_message(const char* fmt, ...);
+mi_decl_internal void       _mi_verbose_message(const char* fmt, ...);
+mi_decl_internal void       _mi_trace_message(const char* fmt, ...);
+mi_decl_internal void       _mi_options_init(void);
+mi_decl_internal void       _mi_error_message(int err, const char* fmt, ...);
 ...
q66 commented 1 month ago

I submitted my proof of concept integration: https://github.com/chimera-linux/cports/pull/2673

daanx commented 1 month ago

Very cool! I see what you mean -- I thought -fvisibility=hidden would take care of it but it seems that only works for shared libraries. A static object file still contains all symbols?

Already, this is a bit of trouble -- I have all these declarations in mimalloc/internal.h that need to be visible if we compile each source individually. If we would like to make all internal functions static that can only work with the single-source static build in static.c -- is that would you use now for the libc build?

I think you are asking to upstream all the internal declarations with a mi_decl_internal right? Maybe that is nice anyway; we might be able to also remove the -fvisibility=hidden flag in that case as well. Let me ponder a bit on this. Thanks!

q66 commented 1 month ago

that is what i am doing, the single static.c build and then mark every function decl static outside of my few entrypoints; everything then becomes local symbols, and they get stripped and do not conflict with anything

i'm just about ready to land the PR soon (i'm mostly tweaking option values now, but i'm already running it on my machines, desktop and all)

-fvisibility=hidden is still good to have, at least for actual library builds (primarily shared but also static)

q66 commented 1 month ago

or rather, instead of using static.c directly, i have my own custom wrapper source for it: https://github.com/chimera-linux/cports/blob/58230387a299567022c6d89a42e636542816fb47/main/musl/files/mimalloc.c

this lets me expose the glue cleanly as well as tweak some stuff (i found that fully hardened build is actually broken and significantly slower in some scenarios, but i still want things like freelist encoding in place etc.)

q66 commented 1 month ago

i have landed this by the way; i think the declarations patching makes the bigger bulk of the changes, everything else is some minimal plumbing changes (some in the libc, some in the allocator, pretty much just for process init + thread teardown + since we cannot do thread-local storage in libc code due to ldso shenanigans, i just store the thread-local heap pointer inside pthread non-abi section and that works good enough)

daanx commented 1 month ago

Thank you for the info; (and if you make a PR please base it on the dev branch).

q66 commented 1 month ago

the double free checking was pretty cheap in most things i tested (either benchmarks or real world things) so i left it at 4, however enabling padding made realloc-heavy things about 2.5x slower (notably our sort(1) went from 0.8s to 2s-ish for some things)

padding checks result in firefox refusing to play videos with sound; some debugging reveals this is due to padding value checks (only those, the canary check is ok) failing in a free() in libnuma global constructor function when it's dlopened as a dependency of ffmpeg - this happens with my integration as well as when using mimalloc as a preload library with otherwise untouched libc - forcing a preload for libnuma too makes it stop happening, however i have not been able to figure out why this happens yet (i suspect the dlopen constructor stuff is a red herring and it happens either due to a bug in mimalloc or in firefox, as firefox does some mozalloc override shenanigans) - i was going to look into it later, since it happens with preload as well it's a lot safer like this

q66 commented 4 weeks ago
* Maybe we should also expose a way to get `mi_heap_empty` .. on the other hand, maybe it is fine as it is since you really hook up at a very low level with the TLS setup. Hope it is going to be efficient in the end to get the thread local heap pointer :-)

i think what i have is pretty much as efficient as it can get; since it has access to the internal structure of the pthread implementation, all it has to do is 1) get the thread pointer (this is done via per-arch asm and is inlined) 2) from it retrieve the pthread struct (either it's that directly or it's at an offset depending on architecture) 3) get the right field (everything here is inlined)

i ran mimalloc-bench and some other things and with MI_SECURE=0 it performs pretty much exactly like a regular standalone build, with my particular hardening setup it's somewhere inbetween standard MI_SECURE=0 and MI_SECURE=4, still performs very well

i'll try to think of some clean abstraction in case we want to upstream this (my current thing is pretty specific to my particular musl integration, though otoh besides the declarations patching it's very slim so it's trivial to maintain downstream)

q66 commented 4 weeks ago

to provide more detail for the firefox thing, this is what triggers the failure:

https://github.com/numactl/numactl/blob/master/libnuma.c#L435

it's strange because the mallocs/reallocs of that happen in the same function

it's triggered from numa_init:

https://github.com/numactl/numactl/blob/master/libnuma.c#L95

the backtrace we get is something like so:

https://img.ayaya.dev/9hNSGioGScjh

what actually fails in free.c is this:

https://github.com/microsoft/mimalloc/blob/dev/src/free.c#L465

(the padding is decoded correctly)

this only happens when triggered from the constructor called by dlopen like http://git.musl-libc.org/cgit/musl/tree/ldso/dynlink.c#n2217

i initially thought this was related to the tls being grown which could be perhaps corrupting my heap pointer somehow, but further investigation ruled that out; everything seems to stay intact (and we don't use tls at all here technically, since we store the heap pointer inside pthread instead, and there is no other tls-allocated storage in mimalloc, and it happens identically with a preloaded mimalloc and otherwise untouched libc, which should be fine and supported)

daanx commented 3 weeks ago

I don't know anything about libnuma or get_mempolicy, but looking at the code that triggers the error, it seems wrong?

if (nodemask_sz == 0) {/* fall back on error */
        int pol;
        unsigned long *mask = NULL;
        nodemask_sz = 16;
        do {
            nodemask_sz <<= 1;
            mask = realloc(mask, nodemask_sz / 8);
            if (!mask)
                return;
        } while (get_mempolicy(&pol, mask, nodemask_sz + 1, 0, 0) < 0 && errno == EINVAL &&
                nodemask_sz < 4096*8);
        free(mask);
    }

https://github.com/numactl/numactl/blob/master/libnuma.c#L435

The docs say _If the mode argument is not NULL, then getmempolicy() will store the policy mode and any optional mode flags of the requested NUMA policy in the location pointed to by this argu‐ ment. If nodemask is not NULL, then the nodemask associated with the policy will be stored in the location pointed to by this argument. maxnode specifies the number of node IDs that can be stored into nodemask—that is, the maximum node ID plus one. The value specified by maxnode is always rounded to a multiple of sizeof(unsigned long)*8.

but that makes no sense (as sizeof(unsigned long)*8 is 64?). Maybe I am missing something here -- but it would be good to rule out that there is no actual buffer overflow. Also, since this is a "backup" code path in case the other method fails, maybe that is why it only fails on pre-load and not otherwise.

Anyway, just my 2c.

q66 commented 3 weeks ago

oh, you might actually be right - i was too preoccupied with "free() fails on stuff that was obviously reallocated in the same code block" that i did not consider that get_mempolicy itself may be corrupting the memory and therefore breaking the allocator

i will test this (that said, the padding checking is still too expensive in my testing to enable in prod, unfortunately)

q66 commented 3 weeks ago

fixed it: https://github.com/chimera-linux/cports/commit/9cc6ce34c9124660f691c9f37d391b20a6ae580f

the correct value is nodemask_sz, not nodemask_sz / 8, which can be derived from the prototype of the function in the documentation

q66 commented 3 weeks ago

fixed it properly... i think: https://github.com/chimera-linux/cports/commit/c9bd817091d8db290c195bee831f19379fef854b

(why is arithmetic so hard :P)

daanx commented 3 weeks ago

Ha, great that the mimalloc padding was working as intended and caught the bug :-) Bugs in "fail paths" are usually tricky to catch. Not sure how connected you are in the Linux community but perhaps the same issue exists in other distributions or libnuma and worth fixing upstream?