google / tcmalloc

Apache License 2.0
4.41k stars 480 forks source link

tcmalloc crash when /sys/devices/system/cpu/possible is sandboxed away #245

Closed jonmeow closed 3 months ago

jonmeow commented 3 months ago

We're trying to use tcmalloc in a sandboxed environment that hides /sys. This causes a crash in NumPossibleCPUsNoCache when trying to read "/sys/devices/system/cpu/possible".

It looks like tcmalloc used to use absl's implementation, which I think was calling std::thread::hardware_concurrency (here). What do you think of adding that as a fallback when the /sys info is unavailable?

For context, the specific crash is:

1 external/tcmalloc~/tcmalloc/internal/sysinfo.cc:124] CHECK in NumPossibleCPUsNoCache: cpus.has_value() (false) 
Program terminated with signal: SIGSEGV
Compiler returned: 139

The specific environment is compiler-explorer, so you can look at what's available pretty easily, e.g. here's some relevant Python queries: https://python.compiler-explorer.com/z/jT6baYxba

std::thread::hardware_concurrency returns 2 in this environment (https://cpp.compiler-explorer.com/z/r4dfbxxMs)

The issue reported to us is https://github.com/carbon-language/carbon-lang/issues/4176. It's temporarily reproducible at https://carbon.compiler-explorer.com/z/Ecfjh7bhs, but we'll be trying to fix that promptly and then I think the Python check above is the better reference.

ckennelly commented 3 months ago

TCMalloc needs NumPossibleCPUsNocache to be non-allocating, since we use it to size some data structures during our very first allocation.

In the past, we've run into issues where std::thread::hardware_concurrency might allocate depending on C++ stdlib/libc implementation details. For example, libc++'s implementation seems to call sysconf in many cases, which is async-unsafe for heap allocations.

If you know where sysconf is sourcing the information from, it'd certainly be possible to directly fetch it in a way careful to not allocate.

jonmeow commented 3 months ago

In glibc, AFAICT sysconf calls get_nprocs (here), which calls get_nproc_fallback(here), which reads /proc/stat (here).

That said, get_nprocs is marked as: Preliminary: | MT-Safe | AS-Safe | AC-Safe fd. Would that be an okay choice, avoiding a direct read of /proc/stat?

ckennelly commented 3 months ago

I think we need the _SC_NPROCESSORS_CONF case, since offlined CPUs can cause problems (see issue #188 ), which reads from /sys as well: https://github.com/bminor/glibc/blob/32328a5a1461ff88c0b1e04954e9c68b3fa7f56d/sysdeps/unix/sysv/linux/getsysstats.c#L231

jonmeow commented 3 months ago

To confirm, are you okay with get_nprocs_conf even though it's documented as unsafe? (note, this may reflect out-of-date documentation, it looks like the implementation changed a couple years ago)

Note, I'm happy to send you a change if you're okay with one of:

  1. Use get_nprocs_conf in place of current code (since it reads "possible")
  2. Keep the current code, but use get_nprocs as a fallback (since it's marked as safe)
  3. Add a read of /proc/stat similar to the get_nprocs_fallback code (since get_nprocs_fallback isn't an API)
ckennelly commented 3 months ago

Let me clarify:

I'm not sure if there is an alternative to reading /sys that provides the number of possible CPU IDs that might be available.

jonmeow commented 3 months ago

We're going to see if we can get an exception in the compiler-explorer side, since it's not readily apparent whether the /proc fallback done by glibc provides a compatible alternative for /sys/devices/system/cpu/possible.