jemalloc initializes mutexes too early

krytarowski commented 4 years ago

libpthread(3) functions are redirected to libc stubs before libpthread(3) initialization on NetBSD.

#4  0x00007f7ff72d94a0 in je_malloc_mutex_init (mutex=0x7f7fffffe600,
mutex@entry=0x7f7ff70000d0, name=name@entry=0x7f7ff7387106 "base",
    rank=rank@entry=18,
lock_order=lock_order@entry=malloc_mutex_rank_exclusive) at
/usr/src/external/bsd/jemalloc/lib/../dist/src/mutex.c:167
#5  0x00007f7ff731834f in je_base_new (tsdn=tsdn@entry=0x0,
ind=ind@entry=0, extent_hooks=0x7f7ff75ca120 <je_extent_hooks_default>)
    at /usr/src/external/bsd/jemalloc/lib/../dist/src/base.c:366
#6  0x00007f7ff73185d2 in je_base_boot (tsdn=tsdn@entry=0x0) at
/usr/src/external/bsd/jemalloc/lib/../dist/src/base.c:512
#7  0x00007f7ff731011d in malloc_init_hard_a0_locked () at
/usr/src/external/bsd/jemalloc/lib/../dist/src/jemalloc.c:1327
#8  0x00007f7ff73107f5 in malloc_init_hard () at
/usr/src/external/bsd/jemalloc/lib/../dist/src/jemalloc.c:1549
#9  0x00007f7ff723f924 in ?? () from /usr/lib/libc.so.12
#10 0x00007f7ff7ef9800 in ?? ()
#11 0x00007f7ff723ac09 in _init () from /usr/lib/libc.so.12
#12 0x0000000000000000 in ?? ()

This results in creation of 'dummy' mutexes and crash later.

JEMALLOC_LAZY_LOCK does not help either as it still initializes the lock.

FreeBSD uses an internal mutex_init function: _pthread_mutex_init_calloc_cb as it also needs to allocate during mutex_init, this is not the case for NetBSD.

What is the way forward? Can we delay initialization of mutexes in jemalloc until the moment we really need them? As a last resort option we could introduce an internal function similar to _pthread_mutex_init_calloc_cb that is a hack initializing a mutex early unconditionally no matter whether libpthread is ready or not.

For the time being we introduced a temporary hack to stop redirecting certain functions from libpthread to libc stub functions.

krytarowski commented 4 years ago

Would it be possible to rely completely on futexes instead of libpthread's mutexes?

krytarowski commented 4 years ago

Here is an example implementation of mutex_lock, mutex_unlock and mutex_trylock:

https://locklessinc.com/articles/mutex_cv_futex/

krytarowski commented 4 years ago

We fixed it inside libc to stop using malloc() too early.

krytarowski commented 4 years ago

Actually the fix did not work fully and was reverted. It broke 'git grep' and some other programs.

$ gdb --args git grep test1
GNU gdb (GDB) 8.3
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64--netbsd".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from git...
(No debugging symbols found in git)
(gdb) r
Starting program: /usr/pkg/bin/git grep test1
warning: the debug information found in "/usr/libdata/debug//usr/lib/libc.so.12.214.debug" does not match "/usr/lib/libc.so.12" (CRC mismatch).

warning: the debug information found in "/usr/libdata/debug//lib/libc.so.12.214.debug" does not match "/usr/lib/libc.so.12" (CRC mismatch).

[New LWP 2 of process 520]
[New LWP 3 of process 520]
[New LWP 4 of process 520]
[New LWP 5 of process 520]
[New LWP 6 of process 520]
[New LWP 7 of process 520]
[New LWP 8 of process 520]
[New LWP 9 of process 520]
[New process 520]
[Detaching after fork from child process 352]

Thread 4 "" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 4 of process 520]
0x00007f7ff66e2401 in je_extent_heap_remove () from /usr/lib/libc.so.12
(gdb) bt
#0  0x00007f7ff66e2401 in je_extent_heap_remove () from /usr/lib/libc.so.12
#1  0x00007f7ff670a047 in ?? () from /usr/lib/libc.so.12
#2  0x00007f7ff66b4e74 in je_tcache_bin_flush_small () from /usr/lib/libc.so.12
#3  0x00007f7ff67133bd in free () from /usr/lib/libc.so.12
#4  0x000000000051333c in grep_source_clear ()
#5  0x000000000043f1f6 in run ()
#6  0x00007f7ff6c0cbc2 in pthread__create_tramp (cookie=0x7f7ff7e20000) at /usr/src/lib/libpthread/pthread.c:587
#7  0x00007f7ff668da30 in ?? () from /usr/lib/libc.so.12
#8  0x0000000000000000 in ?? ()

My changes: https://github.com/NetBSD/src/commit/cf38bb24267f99e069d7789bef1207537efdaa71 https://github.com/NetBSD/src/commit/8ad1b664f9f29b45dac42b8993b7a7090cd435bb

I tried to replace malloc+free in pthread_atfork() with mmap()+munmap() to avoid premature initialization. The initialization is triggered by libpthread bootstrap routine that calls pthread_atfork().

Some hints are welcome. The backtrace from jemalloc is cryptic to me.

davidtgoldblatt commented 4 years ago

Can I clarify the underlying goal? Is the deferring-initialization thing necessary for correctness (i.e. because pthread_mutex_init allocates) or is the goal just to avoid unnecessary locking costs in the case where the process is single-threaded. (I guess what I'm asking is, why do the stubbing strategy for initialization?).

krytarowski commented 4 years ago

The ultimate goal is to avoid interfering with libpthread initialization.

During my initial message in this thread I did not know what is the root cause for initializing early. Today I know that it is caused by pthread_atfork(3) called inside pthread__init(). I abandon in this thread the libpthread-less mutexes as it was a red herring.

I tried to switch pthread_atfork(3) from malloc(3) to mmap(2), but it caused cryptic crashes in certain applications that involve forking (and threads?), namely git grep and few others. I can see that jemalloc does some magic with pthread_atfork(3), does it make any assumptions that there was used malloc(3) with this call and it would cause fallout?

davidtgoldblatt commented 4 years ago

We do call pthread_atfork as part of malloc bootstrapping. If pthread initialization also calls pthread_atfork, and that allocates, it could lead to some unexpected reentrancy scenarios -- do you have an example failure stack trace?

krytarowski commented 4 years ago

There is no crash in that scenario (at least I am not aware).

krytarowski commented 4 years ago

I have reworked pthread(3) + libc initialization code again and it should be fixed as we no longer initialize jemalloc too early.

krytarowski commented 4 years ago

# mpv file
segfault
# gdb mpv mpv.core

Core was generated by `mpv'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007796410ebd31 in je_extent_heap_remove (ph=ph@entry=0x779630e05a58, phn=phn@entry=0x779631207340) at /usr/src/external/bsd/jemalloc/lib/../dist/src/extent.c:278
278     ph_gen(, extent_heap_, extent_heap_t, extent_t, ph_link, extent_snad_comp)
[Current thread is 1 (process 3)]
(gdb) bt
#0  0x00007796410ebd31 in je_extent_heap_remove (ph=ph@entry=0x779630e05a58, phn=phn@entry=0x779631207340) at /usr/src/external/bsd/jemalloc/lib/../dist/src/extent.c:278
#1  0x0000779641113977 in arena_bin_slabs_nonfull_remove (slab=0x779631207340, bin=0x779630e059e0) at /usr/src/external/bsd/jemalloc/lib/../dist/src/arena.c:1484
#2  arena_dissociate_bin_slab (bin=0x779630e059e0, slab=0x779631207340, arena=0x779630e008c0) at /usr/src/external/bsd/jemalloc/lib/../dist/src/arena.c:1484
#3  arena_dalloc_bin_locked_impl (tsdn=tsdn@entry=0x77964c1f90a0, arena=arena@entry=0x779630e008c0, slab=0x779631207340, ptr=<optimized out>, junked=junked@entry=true) at /usr/src/external/bsd/jemalloc/lib/../dist/src/arena.c:1546
#4  0x0000779641114d49 in je_arena_dalloc_bin_junked_locked (tsdn=tsdn@entry=0x77964c1f90a0, arena=arena@entry=0x779630e008c0, extent=<optimized out>, ptr=<optimized out>) at /usr/src/external/bsd/jemalloc/lib/../dist/src/arena.c:1562
#5  0x00007796410be7a5 in je_tcache_bin_flush_small (tsd=tsd@entry=0x77964c1f90a0, tcache=tcache@entry=0x77964c1f9260, tbin=tbin@entry=0x77964c1f93f0, binind=binind@entry=16, rem=5) at /usr/src/external/bsd/jemalloc/lib/../include/jemalloc/internal/tsd.h:138
#6  0x00007796410bfc72 in je_tcache_event_hard (tsd=tsd@entry=0x77964c1f90a0, tcache=tcache@entry=0x77964c1f9260) at /usr/src/external/bsd/jemalloc/lib/../dist/src/tcache.c:54
#7  0x000077964111afb5 in tcache_event (tcache=<optimized out>, tsd=<optimized out>) at /usr/src/external/bsd/jemalloc/lib/../include/jemalloc/internal/tcache_inlines.h:37
#8  tcache_alloc_large (slow_path=<optimized out>, zero=<optimized out>, binind=<optimized out>, size=<optimized out>, tcache=<optimized out>, arena=<optimized out>, tsd=<optimized out>) at /usr/src/external/bsd/jemalloc/lib/../include/jemalloc/internal/tcache_inlines.h:159
#9  arena_malloc (slow_path=<optimized out>, tcache=<optimized out>, zero=<optimized out>, ind=<optimized out>, size=<optimized out>, arena=<optimized out>, tsdn=<optimized out>) at /usr/src/external/bsd/jemalloc/lib/../include/jemalloc/internal/arena_inlines_b.h:98
#10 iallocztm (slow_path=<optimized out>, arena=<optimized out>, is_internal=<optimized out>, tcache=<optimized out>, zero=<optimized out>, ind=<optimized out>, size=<optimized out>, tsdn=<optimized out>) at /usr/src/external/bsd/jemalloc/lib/../include/jemalloc/internal/jemalloc_internal_inlines_c.h:53
#11 imalloc_no_sample (ind=<optimized out>, usize=<optimized out>, size=<optimized out>, tsd=<optimized out>, dopts=<optimized out>, sopts=<optimized out>) at /usr/src/external/bsd/jemalloc/lib/../dist/src/jemalloc.c:1742
#12 imalloc_body (tsd=<optimized out>, dopts=<optimized out>, sopts=<optimized out>) at /usr/src/external/bsd/jemalloc/lib/../dist/src/jemalloc.c:1938
#13 imalloc (dopts=<synthetic pointer>, sopts=<synthetic pointer>) at /usr/src/external/bsd/jemalloc/lib/../dist/src/jemalloc.c:2041
#14 malloc (size=<optimized out>) at /usr/src/external/bsd/jemalloc/lib/../dist/src/jemalloc.c:2071
#15 0x0000779642c1287b in luaM_realloc_ (L=L@entry=0x77964c07f000, block=block@entry=0x0, osize=osize@entry=0, nsize=nsize@entry=160) at lmem.c:84
#16 0x0000779642c181b4 in setnodevector (size=4, t=0x77964b573940, L=0x77964c07f000) at ltable.c:291
#17 luaH_resize (L=L@entry=0x77964c07f000, t=t@entry=0x77964b573940, nasize=0, nhsize=nhsize@entry=4) at ltable.c:312
#18 0x0000779642c1b01d in luaV_execute (L=L@entry=0x77964c07f000) at lvm.c:603
#19 0x0000779642c0e05b in luaD_call (L=L@entry=0x77964c07f000, func=<optimized out>, nResults=nResults@entry=0, allowyield=allowyield@entry=0) at ldo.c:402
#20 0x0000779642c095d6 in lua_callk (L=L@entry=0x77964c07f000, nargs=nargs@entry=0, nresults=nresults@entry=0, ctx=ctx@entry=0, k=k@entry=0x0) at lapi.c:905
#21 0x000000000049782a in load_scripts (L=0x77964c07f000) at ../player/lua.c:271
#22 0x0000779642c0dc47 in luaD_precall (L=L@entry=0x77964c07f000, func=<optimized out>, nresults=0) at ldo.c:319
#23 0x0000779642c0e03e in luaD_call (L=0x77964c07f000, func=<optimized out>, nResults=<optimized out>, allowyield=<optimized out>) at ldo.c:401
#24 0x0000779642c0d488 in luaD_rawrunprotected (L=L@entry=0x77964c07f000, f=f@entry=0x779642c0788b <f_call>, ud=ud@entry=0x7796323efc90) at ldo.c:131
#25 0x0000779642c0e2eb in luaD_pcall (L=L@entry=0x77964c07f000, func=func@entry=0x779642c0788b <f_call>, u=u@entry=0x7796323efc90, old_top=48, ef=<optimized out>) at ldo.c:603
#26 0x0000779642c096b3 in lua_pcallk (L=L@entry=0x77964c07f000, nargs=nargs@entry=0, nresults=nresults@entry=0, errfunc=errfunc@entry=-2, ctx=ctx@entry=0, k=k@entry=0x0) at lapi.c:949
#27 0x0000000000498e4d in run_lua (L=0x77964c07f000) at ../player/lua.c:332
#28 0x0000779642c0dc47 in luaD_precall (L=L@entry=0x77964c07f000, func=<optimized out>, nresults=0) at ldo.c:319
#29 0x0000779642c0e03e in luaD_call (L=0x77964c07f000, func=<optimized out>, nResults=<optimized out>, allowyield=<optimized out>) at ldo.c:401
#30 0x0000779642c0d488 in luaD_rawrunprotected (L=L@entry=0x77964c07f000, f=f@entry=0x779642c0788b <f_call>, ud=ud@entry=0x7796323efed0) at ldo.c:131
#31 0x0000779642c0e2eb in luaD_pcall (L=L@entry=0x77964c07f000, func=func@entry=0x779642c0788b <f_call>, u=u@entry=0x7796323efed0, old_top=16, ef=<optimized out>) at ldo.c:603
#32 0x0000779642c096b3 in lua_pcallk (L=L@entry=0x77964c07f000, nargs=nargs@entry=1, nresults=nresults@entry=0, errfunc=errfunc@entry=0, ctx=ctx@entry=0, k=k@entry=0x0) at lapi.c:949
#33 0x00000000004992e7 in mp_cpcall (func=0x498b47 <run_lua>, ud=0x77964c082040, L=0x77964c07f000) at ../player/lua.c:102
#34 load_lua (client=<optimized out>, fname=0x77964c0d08d0 "@osc.lua") at ../player/lua.c:365
#35 0x00000000004a0e28 in script_thread (p=0x77964c155fe0) at ../player/scripting.c:95
#36 0x000077964480cbd2 in ?? () from /usr/lib/libpthread.so.1
#37 0x000077964108fd10 in ?? () from /usr/lib/libc.so.12
Backtrace stopped: Cannot access memory at address 0x7796323f0000
(gdb)

It did not work... I don't understand the jemalloc crashes. Basically we have got a conflict that we call pthread_atfork() during libpthread init and this starts initializing jemalloc. If we defer it at the very end when libpthread interfaces are usable, something crashes with cryptic backtrace to me.

Any suggestion what is going on is welcome.

krytarowski commented 4 years ago

There is a isthreaded variable in jemalloc, how usabe it is? Could I use it to defer initialization (any usage of pthread APIs) inside jemalloc?

davidtgoldblatt commented 4 years ago

So in the stack trace above, there are lua frames on the stack calling malloc? I don't think that could be pthread initialization, then; lua presumably hasn't had a chance to set up any atfork handlers.

isthreaded is used in lazy locking on FreeBSD, but I don't think I remember the details -- I think it's for similar recursion issues, though.

Is there an easy repro for this? I.e. can I spin up a VM and poke at some crash?

krytarowski commented 4 years ago

Could we coordinate on IRC (FreeNode, channel #jemalloc)? I will help to prepare a machine in VM.

krytarowski commented 4 years ago

Steps to reproduce:

Please wait for images from 17-Feb-2020 without broken jemalloc again.
Fetch and install http://nycdn.netbsd.org/pub/NetBSD-daily/HEAD/latest/images/NetBSD-9.99.47-amd64.iso
Setup pkgsrc, build git and clone https://github.com/netbsd/src
Apply changes from http://netbsd.org/~kamil/patch-00229-pthread_malloc_init_model.txt
Build the distribution.
Setup chroot for new distribution and test programs inside it, see that some of them break.

In practice you need some knowledge how to perform each step if you are totally unfamiliar with BSD so I can assist on IRC.

krytarowski commented 4 years ago

@davidtgoldblatt ping?

davidtgoldblatt commented 4 years ago

I think I was able to mostly follow the steps you gave there. I'm sort of stuck on building:

it seems build.sh doesn't do incremental builds? Its first step if rm -r on a bunch of build directories. Is there a way to get second, third, etc. runs to take less time?
Trying just make in the build directory seems to hit build errors in some targets. Is this expected?

Happy to chat over IRC; where/when were you thinking?

krytarowski commented 4 years ago

I'm there on IRC almost continuously, just ping me.

For incremental builds, there is the ./build.sh -u switch.

Manual make invocations are for experienced users/developers.

krytarowski commented 4 years ago

It was me in the IRC channel.. but it was night for me.

davidtgoldblatt commented 4 years ago

BTW, I haven't forgotten about this, but things will be busy for a while on my end.

krytarowski commented 4 years ago

Thanks! Take care.

jemalloc / jemalloc

jemalloc initializes mutexes too early #1753