Open valenting opened 2 months ago
is it possible that firefox does some malloc() games? i.e. interposes malloc() but doesn't interpose malloc_usable_size()? that's simply broken.
is it possible that firefox does some malloc() games? i.e. interposes malloc() but doesn't interpose malloc_usable_size()? that's simply broken.
I've seen similar crashes due to the exact same issue (override malloc() but not malloc_usable_size()). Ideally we wouldn't use malloc_usable_size() in NSS stuff since this seems to have happened at least twice now.
is it possible that firefox does some malloc() games? i.e. interposes malloc() but doesn't interpose malloc_usable_size()? that's simply broken.
I believe Firefox also overrides malloc_usable_size with the jemalloc version - unless there's a bug somewhere. CC: @jesup @PaulBone
@valenting Aren't the overrides strictly internal? I don't see any
I was mistaken, realloc
symbol definition in the Fedora 129 release binaries.firefox-bin
interposes realloc
, but also malloc_usable_size
.
Well, for some reason mozalloc_abort() ended up within the stack frame of libnss_resolve, which suggests that for some reason our nss module calls back into mozilla code, and that indicates to me that some interposing is taking place, because i don't see how that could otherwise happen (unless the stack trace is somehow entirely corrupted)
0.39.0.0
that's not a systemd version btw. Which systemd version is this about specifically?
btw, is there any chance to get a full backtrace for this, with libnss-resolve also enriched with debug symbols? otherwise this is not really actionable to us.
0.39.0.0
that's not a systemd version btw. Which systemd version is this about specifically?
There was one crash report that had this version I think, but most of the recent crashes have version 2:
libnss_resolve.so.2 2.0.0.0 9A5A9A2015DD1A70EF5761D5ED4496E80 libnss_resolve.so.2
btw, is there any chance to get a full backtrace for this, with libnss-resolve also enriched with debug symbols? otherwise this is not really actionable to us.
I believe we don't have debug symbols for this shared object. @gabrielesvelto : is there a chance to process the symbols for these versions?
Well, for some reason mozalloc_abort() ended up within the stack frame of libnss_resolve, which suggests that for some reason our nss module calls back into mozilla code, and that indicates to me that some interposing is taking place, because i don't see how that could otherwise happen (unless the stack trace is somehow entirely corrupted)
Something in libnss_resolve calls abort() which Firefox intercepts.
There was one crash report that had this version I think, but most of the recent crashes have version 2:
libnss_resolve.so.2 2.0.0.0 9A5A9A2015DD1A70EF5761D5ED4496E80 libnss_resolve.so.2
but what systemd release does that translate to? the so version doesn't help us much, it's not related to the systemd release
There was one crash report that had this version I think, but most of the recent crashes have version 2:
libnss_resolve.so.2 2.0.0.0 9A5A9A2015DD1A70EF5761D5ED4496E80 libnss_resolve.so.2
but what systemd release does that translate to? the so version doesn't help us much, it's not related to the systemd release
I don't think the minidump information captures systemd version - if you know of a way to extract that let me know. First reported crash was on 2024-05-23 12:33:24, so that might indicate the systemd release this first affected.
These crashes have better stack traces:
The paths in the first crash suggest this is from the systemd-255.10-1 package for Fedora 40.
I'll try to extract the debug information for Arch too to clean up the stack trace in the first comment.
Looks from those traces that it's "systemd-255.10-1.fc40.x86_64" so Fedora Core 40's systemd 255.10-1.
_nss_resolve_gethostbyname4_r (nss-resolve.c:233)
...
varlink_unref (varlink.c:611)
...
varlink_clear (varlink.c:568)
...
safe_close (fd-util.c:75)
Which then aborts because the fd is invalid:
https://github.com/systemd/systemd/blob/main/src/basic/fd-util.c#L75C17-L75C54
I have reprocessed Arch crashes with full debug information and I found different stack traces triggered by the same assertion, here's one:
https://crash-stats.mozilla.org/report/index/34616e83-9229-4efe-ac80-995840240829
The stack trace looks like this:
0 firefox MOZ_Crash(char const*, int, char const*) /usr/src/debug/firefox-developer-edition/firefox-130.0/obj/dist/include/mozilla/Assertions.h:317
0 firefox mozalloc_abort /usr/src/debug/firefox-developer-edition/firefox-130.0/memory/mozalloc/mozalloc_abort.cpp:35
1 firefox abort /usr/src/debug/firefox-developer-edition/firefox-130.0/memory/mozalloc/mozalloc_abort.cpp:88
2 libudev.so.1 log_assert_failed /usr/src/debug/systemd/systemd/src/basic/log.c:995
3 libudev.so.1 safe_close /usr/src/debug/systemd/systemd/src/basic/fd-util.c:75
4 libudev.so.1 closep /usr/src/debug/systemd/systemd/src/basic/fd-util.h:45
4 libudev.so.1 device_set_syspath /usr/src/debug/systemd/systemd/src/libsystemd/sd-device/sd-device.c:148
5 libudev.so.1 device_new_from_syspath /usr/src/debug/systemd/systemd/src/libsystemd/sd-device/sd-device.c:271
6 libudev.so.1 sd_device_new_from_syspath /usr/src/debug/systemd/systemd/src/libsystemd/sd-device/sd-device.c:280
6 libudev.so.1 udev_device_new_from_syspath /usr/src/debug/systemd/systemd/src/libudev/libudev-device.c:261
7 libxul.so (anonymous namespace)::LinuxGamepadService::ScanForDevices() /usr/src/debug/firefox-developer-edition/firefox-130.0/dom/gamepad/linux/LinuxGamepad.cpp:329
[...]
OK, so this very strongly indicates that something is closing some fds behind our back.
in systemd if we get EBADF from close() we'll hit an assert, since that always means there's some form of corruption taking place, i.e. some unrelated code closing our fds.
This strongly points to some other thread in ffox closing the wrong fd and thus tripping us up.
so yeah, all three backtraces that carry symbols posted here show an fd issue, and in different pieces of our code. Hence I strongly doubt this is a systemd problem, but rather a firefox fd double close issue or something similar.
I agree, especially given that it's totally different stacks entering different systemd functions and we definitely do fiddle with closing fds (e.g. after fork).
That said, both crashes are in the parent (which removes errors after forking as a potential cause), and the second crash seems to be in libudev functions that open the erroneous fd itself.
Looking through the code, I see: https://github.com/systemd/systemd/blob/main/src/libsystemd/sd-device/sd-device.c#L148C37-L148C39
Which seems to indicate any (unhandled?) error condition within this would cause exactly the crash seen. So I'm not so sure anymore it's a Firefox bug.
I agree, especially given that it's totally different stacks entering different systemd functions and we definitely do fiddle with closing fds (e.g. after fork).
hmm, what does ffox do regarding forking? i hope you are not forking and expect libc NSS to still work in the child after fork() before execve()?
Forking isn't relevant here, the crashes are all in the parent process.
Looking through the code, I see: https://github.com/systemd/systemd/blob/main/src/libsystemd/sd-device/sd-device.c#L148C37-L148C39
Which seems to indicate any (unhandled?) error condition within this would cause exactly the crash seen. So I'm not so sure anymore it's a Firefox bug.
the cleanup handler attached to that fd var is a NOP if the fd is negative. hence we initialize the fd to -EBADF, so that until the fd is actually initialized the cleanup handler has no effect.
it's a general pattern in our codebase: we use gcc cleanup handlers, and these cleanup handlers all are graceful so that "unset" (i.e. null in case of pointers, or < 0 in case of fds) variables result in NOP cleanup handling.
Or in other words: you hit the assert here not becasue of the EBADF assignment, but because of an assignment >= 0 further down.
I haven't looked at the libnss_resolve crash yet but I have looked at the udev one. The assertion is being triggered within a single call-chain that is completely contained in libudev, the file descriptor is being opened and closed there. To trigger this issue with Firefox' code we'd need another thread closing a file descriptor it doesn't own while that function is running.
I find it rather unlikely especially given that most of our file descriptor usage is either done via RAII C++ classes or Rust code that tracks file descriptor ownership. There is some bare file descriptor manipulation too but it mostly happens in code that's external to Firefox (e.g. mesa), but even then I find it unlikely that some code would close a file descriptor it doesn't own, and do so in a time window so short to be able to race that particular udev function.
Reading libudev's code I found myself in the chase()
function which is extremely large and complex. Could it be returning a file descriptor that has already been closed in the ret_fd
parameter? I tried following the function but it's so large I am unable to manually check all the possible paths.
[edit] Fixed a typo, I had mixed the library name with our NSS
varlink_clear()
also ends up doing things like this:
close_many(v->input_fds, v->n_input_fds);
v->input_fds = mfree(v->input_fds);
v->n_input_fds = 0;
...
close_many(v->output_fds, v->n_output_fds);
v->output_fds = mfree(v->output_fds);
v->n_output_fds = 0;
close_many(v->pushed_fds, v->n_pushed_fds);
v->pushed_fds = mfree(v->pushed_fds);
v->n_pushed_fds = 0;
So it's not inconceivable an fd ending up on two of those lists somehow would trigger this error. I mean, I'll be the last to say Firefox is free of bugs, but after looking closer it may well be that it's simply the amount of usage that exposes two similar bugs outside our own code (that crash in the same guard assertion).
Adding another datapoint in case it helps. We keep crash reports for six months, and we have had these type of crashes for this entire period. So the crash is unlikely to be a recent regression either in Firefox or in systemd's libraries.
Focusing on the libnss_resolve
crash practically all the crashes that we have on file appear to be using systemd 255. All the crashes are also originating on Arch and Fedora which AFAIK are the only distros shipping systemd 255 so that makes sense. There seems to be a strong correlation between these crashes and systemd version 255, otherwise we'd see them on Debian and Ubuntu too which have far more users. I couldn't find crashes with earlier versions but I'll keep looking.
One final datapoint: the oldest libnss_resolve
crash I could find is this one. This is for a Flatpak packaged version and contains this mapping:
/usr/lib/x86_64-linux-gnu/libsystemd.so.0.37.0
From what I can tell this should correspond to systemd 254.
Well, we haven't gotten any reports about the NSS module in a long time, even though it is loaded into so many programs. And the few times an issue was reported it turned out to be something about malloc interposition and not a bug on our side. Hence this time again I'd guess it's not our fault here.
RAII C++ classes
well, in systemd we exclusively process fds via gnu c cleanup handlers and move them around via TAKE_FD() which means we are as close to C++ RAII as you possibly could get in C. Hence, in our tree such fd issues are not unheard of but very unlikely IRL talking from experience.
can you reproduce the issue? maybe strace the thing?
I can try to reproduce it, let's see if I can get an STR.
I can't reproduce the crash but I have found something a lot more interesting: all of these crashes are happening on tainted kernels. 100% of them are using Nvidia closed-source drivers. We've been assuming that this is a double close()
, but what if the error returned by the failing close()
call isn't EBADF
? What if it's returning some other error while closing a pseudo-file that the Nvidia kernel modules are placing somewhere under a pseudo-filesystem? The reason why I suspect this is that several of these crashes involve scanning files under /sys
(see the crashes in libeudev). I don't know how this could trigger the getaddrinfo()
crashes but it can't be an accident that 100% of them are using Nvidia closed drivers and they're hitting the same issue.
From my reading of the systemd code, it only asserts on EBADF, not on other errors.
dropping from the milestone, because theire are indications this might be a problem with the nvidia binary driver, that just surfaces here, i.e. they close our fd and really shouldn't.
systemd version the issue has been seen with
0.39.0.0255.10Used distribution
No response
Linux kernel version used
No response
CPU architectures issue was seen on
x86_64
Component
systemd-resolved
Expected behaviour you didn't see
No response
Unexpected behaviour you saw
Crash reports from Firefox having get_nss_addresses / libnss_resolve.so.2 on the stack.
Steps to reproduce the problem
Unknown / This is the regular Firefox DNS resolution codepath.
Additional program output to the terminal or log subsystem illustrating the issue
Original glibc report at https://sourceware.org/bugzilla/show_bug.cgi?id=32139
We've been seeing some Firefox crashes in libnss_resolve.so recently. https://bugzilla.mozilla.org/show_bug.cgi?id=1909130
First reported crash was on 2024-05-23 12:33:24 Crash distribution: OS | crash reports | percentage of total Arch Linux 90 83.3% EndeavourOS Linux 9 8.3% CachyOS 6 5.6% Debian GNU/Linux trixie/sid 2 1.9% Garuda Linux Bird of Prey 1 0.9%
Recent bug report: https://crash-stats.mozilla.org/report/index/8e9d33e0-974d-4828-8299-90d560240901
Sample stack trace:
I think the referenced line is: https://github.com/bminor/glibc/blob/1927f718fcc48bdaea03086bdc2adf11279d655b/nss/getaddrinfo.c#L652