Crash in get_nss_addresses / libnss_resolve.so.2 (nvidia closed source related?)

valenting commented 2 months ago

systemd version the issue has been seen with

~~0.39.0.0~~ 255.10

Used distribution

No response

Linux kernel version used

No response

CPU architectures issue was seen on

x86_64

Component

systemd-resolved

Expected behaviour you didn't see

No response

Unexpected behaviour you saw

Crash reports from Firefox having get_nss_addresses / libnss_resolve.so.2 on the stack.

Steps to reproduce the problem

Unknown / This is the regular Firefox DNS resolution codepath.

Additional program output to the terminal or log subsystem illustrating the issue

Original glibc report at https://sourceware.org/bugzilla/show_bug.cgi?id=32139

We've been seeing some Firefox crashes in libnss_resolve.so recently. https://bugzilla.mozilla.org/show_bug.cgi?id=1909130

First reported crash was on 2024-05-23 12:33:24 Crash distribution: OS | crash reports | percentage of total Arch Linux 90 83.3% EndeavourOS Linux 9 8.3% CachyOS 6 5.6% Debian GNU/Linux trixie/sid 2 1.9% Garuda Linux Bird of Prey 1 0.9%

Recent bug report: https://crash-stats.mozilla.org/report/index/8e9d33e0-974d-4828-8299-90d560240901

Sample stack trace:

0   firefox     MOZ_Crash(char const*, int, char const*)    /usr/src/debug/firefox-developer-edition/firefox-130.0/obj/dist/include/mozilla/Assertions.h:317    inlined
0   firefox     mozalloc_abort  /usr/src/debug/firefox-developer-edition/firefox-130.0/memory/mozalloc/mozalloc_abort.cpp:35    context
1   firefox     abort   /usr/src/debug/firefox-developer-edition/firefox-130.0/memory/mozalloc/mozalloc_abort.cpp:88    cfi
Ø 2     libnss_resolve.so.2     libnss_resolve.so.2@0x307a      cfi
Ø 3     libnss_resolve.so.2     libnss_resolve.so.2@0x53bc      frame_pointer
Ø 4     libnss_resolve.so.2     libnss_resolve.so.2@0x1d4e2         frame_pointer
Ø 5     libnss_resolve.so.2     libnss_resolve.so.2@0xc931      frame_pointer
6   libc.so.6   get_nss_addresses   /usr/src/debug/glibc/glibc/nss/getaddrinfo.c:652    inlined
6   libc.so.6   gaih_inet   /usr/src/debug/glibc/glibc/nss/getaddrinfo.c:1185   inlined
6   libc.so.6   getaddrinfo     /usr/src/debug/glibc/glibc/nss/getaddrinfo.c:2391   frame_pointer
Ø 7     libnspr4.so     libnspr4.so@0x22492         cfi
8   libxul.so   mozilla::net::_GetAddrInfo_Portable(nsTSubstring<char> const&, unsigned short, unsigned short, mozilla::net::AddrInfo**)    /usr/src/debug/firefox-developer-edition/firefox-130.0/netwerk/dns/GetAddrInfo.cpp:244  inlined
8   libxul.so   mozilla::net::GetAddrInfo(nsTSubstring<char> const&, unsigned short, unsigned short, mozilla::net::AddrInfo**, bool)    /usr/src/debug/firefox-developer-edition/firefox-130.0/netwerk/dns/GetAddrInfo.cpp:377  inlined
8   libxul.so   nsHostResolver::ThreadFunc()    /usr/src/debug/firefox-developer-edition/firefox-130.0/netwerk/dns/nsHostResolver.cpp:1946  scan

I think the referenced line is: https://github.com/bminor/glibc/blob/1927f718fcc48bdaea03086bdc2adf11279d655b/nss/getaddrinfo.c#L652

poettering commented 2 months ago

is it possible that firefox does some malloc() games? i.e. interposes malloc() but doesn't interpose malloc_usable_size()? that's simply broken.

DaanDeMeyer commented 2 months ago

is it possible that firefox does some malloc() games? i.e. interposes malloc() but doesn't interpose malloc_usable_size()? that's simply broken.

I've seen similar crashes due to the exact same issue (override malloc() but not malloc_usable_size()). Ideally we wouldn't use malloc_usable_size() in NSS stuff since this seems to have happened at least twice now.

valenting commented 2 months ago

is it possible that firefox does some malloc() games? i.e. interposes malloc() but doesn't interpose malloc_usable_size()? that's simply broken.

I believe Firefox also overrides malloc_usable_size with the jemalloc version - unless there's a bug somewhere. CC: @jesup @PaulBone

fweimer-rh commented 2 months ago

~~@valenting Aren't the overrides strictly internal? I don't see any realloc symbol definition in the Fedora 129 release binaries.~~ I was mistaken, firefox-bin interposes realloc, but also malloc_usable_size.

poettering commented 2 months ago

Well, for some reason mozalloc_abort() ended up within the stack frame of libnss_resolve, which suggests that for some reason our nss module calls back into mozilla code, and that indicates to me that some interposing is taking place, because i don't see how that could otherwise happen (unless the stack trace is somehow entirely corrupted)

poettering commented 2 months ago

0.39.0.0

that's not a systemd version btw. Which systemd version is this about specifically?

poettering commented 2 months ago

btw, is there any chance to get a full backtrace for this, with libnss-resolve also enriched with debug symbols? otherwise this is not really actionable to us.

valenting commented 2 months ago

0.39.0.0

that's not a systemd version btw. Which systemd version is this about specifically?

There was one crash report that had this version I think, but most of the recent crashes have version 2: libnss_resolve.so.2 2.0.0.0 9A5A9A2015DD1A70EF5761D5ED4496E80 libnss_resolve.so.2

btw, is there any chance to get a full backtrace for this, with libnss-resolve also enriched with debug symbols? otherwise this is not really actionable to us.

I believe we don't have debug symbols for this shared object. @gabrielesvelto : is there a chance to process the symbols for these versions?

Well, for some reason mozalloc_abort() ended up within the stack frame of libnss_resolve, which suggests that for some reason our nss module calls back into mozilla code, and that indicates to me that some interposing is taking place, because i don't see how that could otherwise happen (unless the stack trace is somehow entirely corrupted)

Something in libnss_resolve calls abort() which Firefox intercepts.

poettering commented 2 months ago

There was one crash report that had this version I think, but most of the recent crashes have version 2: libnss_resolve.so.2 2.0.0.0 9A5A9A2015DD1A70EF5761D5ED4496E80 libnss_resolve.so.2

but what systemd release does that translate to? the so version doesn't help us much, it's not related to the systemd release

valenting commented 2 months ago

There was one crash report that had this version I think, but most of the recent crashes have version 2: libnss_resolve.so.2 2.0.0.0 9A5A9A2015DD1A70EF5761D5ED4496E80 libnss_resolve.so.2

but what systemd release does that translate to? the so version doesn't help us much, it's not related to the systemd release

I don't think the minidump information captures systemd version - if you know of a way to extract that let me know. First reported crash was on 2024-05-23 12:33:24, so that might indicate the systemd release this first affected.

gabrielesvelto commented 2 months ago

These crashes have better stack traces:

The paths in the first crash suggest this is from the systemd-255.10-1 package for Fedora 40.

I'll try to extract the debug information for Arch too to clean up the stack trace in the first comment.

gcp commented 2 months ago

Looks from those traces that it's "systemd-255.10-1.fc40.x86_64" so Fedora Core 40's systemd 255.10-1.

_nss_resolve_gethostbyname4_r (nss-resolve.c:233)
...
varlink_unref (varlink.c:611)
...
varlink_clear (varlink.c:568)
...
safe_close (fd-util.c:75)

Which then aborts because the fd is invalid:

https://github.com/systemd/systemd/blob/main/src/basic/fd-util.c#L75C17-L75C54

gabrielesvelto commented 2 months ago

I have reprocessed Arch crashes with full debug information and I found different stack traces triggered by the same assertion, here's one:

https://crash-stats.mozilla.org/report/index/34616e83-9229-4efe-ac80-995840240829

The stack trace looks like this:

0 firefox      MOZ_Crash(char const*, int, char const*) /usr/src/debug/firefox-developer-edition/firefox-130.0/obj/dist/include/mozilla/Assertions.h:317
0 firefox      mozalloc_abort /usr/src/debug/firefox-developer-edition/firefox-130.0/memory/mozalloc/mozalloc_abort.cpp:35
1 firefox      abort /usr/src/debug/firefox-developer-edition/firefox-130.0/memory/mozalloc/mozalloc_abort.cpp:88
2 libudev.so.1 log_assert_failed /usr/src/debug/systemd/systemd/src/basic/log.c:995
3 libudev.so.1 safe_close /usr/src/debug/systemd/systemd/src/basic/fd-util.c:75
4 libudev.so.1 closep /usr/src/debug/systemd/systemd/src/basic/fd-util.h:45
4 libudev.so.1 device_set_syspath /usr/src/debug/systemd/systemd/src/libsystemd/sd-device/sd-device.c:148
5 libudev.so.1 device_new_from_syspath /usr/src/debug/systemd/systemd/src/libsystemd/sd-device/sd-device.c:271
6 libudev.so.1 sd_device_new_from_syspath /usr/src/debug/systemd/systemd/src/libsystemd/sd-device/sd-device.c:280
6 libudev.so.1 udev_device_new_from_syspath /usr/src/debug/systemd/systemd/src/libudev/libudev-device.c:261
7 libxul.so    (anonymous namespace)::LinuxGamepadService::ScanForDevices() /usr/src/debug/firefox-developer-edition/firefox-130.0/dom/gamepad/linux/LinuxGamepad.cpp:329
[...]

poettering commented 2 months ago

OK, so this very strongly indicates that something is closing some fds behind our back.

in systemd if we get EBADF from close() we'll hit an assert, since that always means there's some form of corruption taking place, i.e. some unrelated code closing our fds.

This strongly points to some other thread in ffox closing the wrong fd and thus tripping us up.

poettering commented 2 months ago

so yeah, all three backtraces that carry symbols posted here show an fd issue, and in different pieces of our code. Hence I strongly doubt this is a systemd problem, but rather a firefox fd double close issue or something similar.

gcp commented 2 months ago

I agree, especially given that it's totally different stacks entering different systemd functions and ~~we definitely do fiddle with closing fds (e.g. after fork)~~.

gcp commented 2 months ago

That said, both crashes are in the parent (which removes errors after forking as a potential cause), and the second crash seems to be in libudev functions that open the erroneous fd itself.

Looking through the code, I see: https://github.com/systemd/systemd/blob/main/src/libsystemd/sd-device/sd-device.c#L148C37-L148C39

Which seems to indicate any (unhandled?) error condition within this would cause exactly the crash seen. So I'm not so sure anymore it's a Firefox bug.

poettering commented 2 months ago

I agree, especially given that it's totally different stacks entering different systemd functions and we definitely do fiddle with closing fds (e.g. after fork).

hmm, what does ffox do regarding forking? i hope you are not forking and expect libc NSS to still work in the child after fork() before execve()?

gcp commented 2 months ago

Forking isn't relevant here, the crashes are all in the parent process.

poettering commented 2 months ago

Looking through the code, I see: https://github.com/systemd/systemd/blob/main/src/libsystemd/sd-device/sd-device.c#L148C37-L148C39

Which seems to indicate any (unhandled?) error condition within this would cause exactly the crash seen. So I'm not so sure anymore it's a Firefox bug.

the cleanup handler attached to that fd var is a NOP if the fd is negative. hence we initialize the fd to -EBADF, so that until the fd is actually initialized the cleanup handler has no effect.

it's a general pattern in our codebase: we use gcc cleanup handlers, and these cleanup handlers all are graceful so that "unset" (i.e. null in case of pointers, or < 0 in case of fds) variables result in NOP cleanup handling.

Or in other words: you hit the assert here not becasue of the EBADF assignment, but because of an assignment >= 0 further down.

gabrielesvelto commented 2 months ago

I haven't looked at the libnss_resolve crash yet but I have looked at the udev one. The assertion is being triggered within a single call-chain that is completely contained in libudev, the file descriptor is being opened and closed there. To trigger this issue with Firefox' code we'd need another thread closing a file descriptor it doesn't own while that function is running.

I find it rather unlikely especially given that most of our file descriptor usage is either done via RAII C++ classes or Rust code that tracks file descriptor ownership. There is some bare file descriptor manipulation too but it mostly happens in code that's external to Firefox (e.g. mesa), but even then I find it unlikely that some code would close a file descriptor it doesn't own, and do so in a time window so short to be able to race that particular udev function.

Reading libudev's code I found myself in the chase() function which is extremely large and complex. Could it be returning a file descriptor that has already been closed in the ret_fd parameter? I tried following the function but it's so large I am unable to manually check all the possible paths.

[edit] Fixed a typo, I had mixed the library name with our NSS

gcp commented 2 months ago

varlink_clear() also ends up doing things like this:


        close_many(v->input_fds, v->n_input_fds);
        v->input_fds = mfree(v->input_fds);
        v->n_input_fds = 0;
        ...
        close_many(v->output_fds, v->n_output_fds);
        v->output_fds = mfree(v->output_fds);
        v->n_output_fds = 0;

        close_many(v->pushed_fds, v->n_pushed_fds);
        v->pushed_fds = mfree(v->pushed_fds);
        v->n_pushed_fds = 0;

So it's not inconceivable an fd ending up on two of those lists somehow would trigger this error. I mean, I'll be the last to say Firefox is free of bugs, but after looking closer it may well be that it's simply the amount of usage that exposes two similar bugs outside our own code (that crash in the same guard assertion).

gabrielesvelto commented 2 months ago

Adding another datapoint in case it helps. We keep crash reports for six months, and we have had these type of crashes for this entire period. So the crash is unlikely to be a recent regression either in Firefox or in systemd's libraries.

Focusing on the libnss_resolve crash practically all the crashes that we have on file appear to be using systemd 255. All the crashes are also originating on Arch and Fedora which AFAIK are the only distros shipping systemd 255 so that makes sense. There seems to be a strong correlation between these crashes and systemd version 255, otherwise we'd see them on Debian and Ubuntu too which have far more users. I couldn't find crashes with earlier versions but I'll keep looking.

gabrielesvelto commented 2 months ago

One final datapoint: the oldest libnss_resolve crash I could find is this one. This is for a Flatpak packaged version and contains this mapping:

/usr/lib/x86_64-linux-gnu/libsystemd.so.0.37.0

From what I can tell this should correspond to systemd 254.

poettering commented 2 months ago

Well, we haven't gotten any reports about the NSS module in a long time, even though it is loaded into so many programs. And the few times an issue was reported it turned out to be something about malloc interposition and not a bug on our side. Hence this time again I'd guess it's not our fault here.

RAII C++ classes

well, in systemd we exclusively process fds via gnu c cleanup handlers and move them around via TAKE_FD() which means we are as close to C++ RAII as you possibly could get in C. Hence, in our tree such fd issues are not unheard of but very unlikely IRL talking from experience.

poettering commented 2 months ago

can you reproduce the issue? maybe strace the thing?

gabrielesvelto commented 2 months ago

I can try to reproduce it, let's see if I can get an STR.

gabrielesvelto commented 2 months ago

I can't reproduce the crash but I have found something a lot more interesting: all of these crashes are happening on tainted kernels. 100% of them are using Nvidia closed-source drivers. We've been assuming that this is a double close(), but what if the error returned by the failing close() call isn't EBADF? What if it's returning some other error while closing a pseudo-file that the Nvidia kernel modules are placing somewhere under a pseudo-filesystem? The reason why I suspect this is that several of these crashes involve scanning files under /sys (see the crashes in libeudev). I don't know how this could trigger the getaddrinfo() crashes but it can't be an accident that 100% of them are using Nvidia closed drivers and they're hitting the same issue.

gcp commented 2 months ago

From my reading of the systemd code, it only asserts on EBADF, not on other errors.

poettering commented 3 weeks ago

dropping from the milestone, because theire are indications this might be a problem with the nvidia binary driver, that just surfaces here, i.e. they close our fd and really shouldn't.

systemd / systemd