google / sanitizers

AddressSanitizer, ThreadSanitizer, MemorySanitizer
Other
11.46k stars 1.03k forks source link

TSAN crash while creating a vulkan device with no stack trace on linux #1678

Open cenomla opened 1 year ago

cenomla commented 1 year ago

My program crashes (SIGSEGV) on a memory read inside of libnvidia-glcore.so when calling vkCreateDevice, even when using a suppressions file that tries to ignore calls from that lib

Stack trace:

#0  0x00007ffff6e90456 in __sanitizer::SizeClassAllocator64LocalCache<__sanitizer::SizeClassAllocator64<__tsan::AP64> >::Allocate (class_id=4, allocator=0x7ffff6f55c00 <__tsan::allocator_placeholder>, this=0x8)
    at /usr/src/debug/gcc/gcc/libsanitizer/sanitizer_common/sanitizer_allocator_local_cache.h:38
#1  __sanitizer::CombinedAllocator<__sanitizer::SizeClassAllocator64<__tsan::AP64>, __sanitizer::LargeMmapAllocatorPtrArrayDynamic>::Allocate (this=this@entry=0x7ffff6f55c00 <__tsan::allocator_placeholder>, cache=0x8, 
    size=<optimized out>, size@entry=56, alignment=alignment@entry=16) at /usr/src/debug/gcc/gcc/libsanitizer/sanitizer_common/sanitizer_allocator_combined.h:69
#2  0x00007ffff6e8dad2 in __tsan::user_alloc_internal (thr=thr@entry=0x7fef60179f00, pc=140737335538724, sz=sz@entry=56, align=align@entry=16, signal=signal@entry=true) at /usr/src/debug/gcc/gcc/libsanitizer/tsan/tsan_rtl.h:216
#3  0x00007ffff6e8dd4e in __tsan::user_calloc (thr=thr@entry=0x7fef60179f00, pc=<optimized out>, size=size@entry=1, n=n@entry=56) at /usr/src/debug/gcc/gcc/libsanitizer/tsan/tsan_mman.cpp:230
#4  0x00007ffff6e4342a in __interceptor_calloc (size=1, n=56) at /usr/src/debug/gcc/gcc/libsanitizer/tsan/tsan_interceptors_posix.cpp:702
#5  0x00007fefaa0f7aae in ?? () from /usr/lib/libnvidia-glcore.so.535.54.03
#6  0x00007fefaa11335a in ?? () from /usr/lib/libnvidia-glcore.so.535.54.03
#7  0x00007fefaa11218e in ?? () from /usr/lib/libnvidia-glcore.so.535.54.03
#8  0x00007ffff6c9d44b in ?? () from /usr/lib/libc.so.6
#9  0x00007ffff6d20e40 in ?? () from /usr/lib/libc.so.6

Suppressions file:

called_from_lib:libnvidia-tls.so
called_from_lib:libGLX_nvidia.so
called_from_lib:libnvidia-glcore.so.535.54.03
called_from_lib:libnvidia-glsi.so
called_from_lib:libnvidia-glvkspriv.so
called_from_lib:libvulkan.so
called_from_lib:libVkLayer_khronos_validation.so

My program runs fine under ASAN. TSAN ends up mmaping 124T of memory before crashing but my ulimit is set to unlimited for the current shell so I don't think that's the issue. Was compiled using gcc 13.1.1

andmoos commented 1 year ago

+1

andmoos commented 9 months ago

Had to let tsan lay dormant for the time being but now I have to come back to it eventually:

The problem at __interceptor_calloc (and potentially others) originates from the ThreadState *thr = cur_thread_init() in macro SCOPED_INTERCEPTOR_RAW.

The ThreadState in this stack created by the Nvidia library (?) is not "initialized" and the thread local block is used with all null values. Thus it has not ever got a Processor set and returns nullptr. The implementations in tsan_interceptors_posix.cpp does not check/assert for this case and crashes on the first access to the Procecssor.

I tried to "lazy" initialize the ThreadState to a in the SCOPED_INTERCEPTOR_RAW macro but did not understand enough of the codebase to succeed right away. I tried using a ScopedProcessor in this case which got me over the crash on calloc but later failing on free as the Processor was a different scoped one then (?).

As far as I can see the ScopedInterceptor calls if (!thr_->is_inited) return; in its constructor so early that it is effectively bypassing possible ignore logic. But ignoring may or may not be of any help here anyway...

dvyukov commented 9 months ago

The intention is that sanitizers runtimes are initialized before any other code. IIRC on linux we should try to initialize from .preinit_array. Not sure why glcore.so should run earlier, I would assume it's not doing anything this low-level.

andmoos commented 9 months ago

Part of the whole vulkan infrastructure here is a dynamic loader for layers and the actual driver. It should not be running as early but maybe the (potentially nested) dlopen()/dlsym() may cause these issues?

andmoos commented 9 months ago

I put together a small gist allowing to reproduce the issue. Tested on Arch Linux 2024-01-19 with NVIDIA driver 545.29.06 and clang 16.0.6.

andmoos commented 9 months ago

I now also reported this to NVIDIA.

andmoos commented 8 months ago

Update: NVIDIA said they can reproduce this and the vulkan engineering team will be investigating this.

michaelkorenchan commented 7 months ago

Hello, I'm an NVIDIA engineer investigating this bug.

The issue arises because malloc/calloc/free calls in the driver are correctly intercepted by TSAN but pthread_* calls are not. The NVIDIA driver is built with older glibc headers, which we need in order to maintain support for older linux distributions. Because of this, the driver picks up old versions of libpthread that TSAN does not interpose. Conversely malloc/calloc/free only have one version, so interposition is unhindered. When a pthread created without TSAN's pthread_create interceptor enters TSAN's calloc interceptor, that ThreadState struct mentioned above is uninitialized, resulting in the segfault.

I'm wondering what can be done on the TSAN side to account for this.

vitalybuka commented 7 months ago

CC @thurstond

rgriebl commented 3 months ago

Any update on this? TSAN is pretty much unusable for any application using OpenGL or Vulkan on NVIDIA right now.

thurstond commented 3 months ago

@michaelkorenchan

The issue arises because malloc/calloc/free calls in the driver are correctly intercepted by TSAN but pthread_* calls are not. The NVIDIA driver is built with older glibc headers, which we need in order to maintain support for older linux distributions. Because of this, the driver picks up old versions of libpthread that TSAN does not interpose. Conversely malloc/calloc/free only have one version, so interposition is unhindered. When a pthread created without TSAN's pthread_create interceptor enters TSAN's calloc interceptor, that ThreadState struct mentioned above is uninitialized, resulting in the segfault.

I'm wondering what can be done on the TSAN side to account for this.

Thanks for the investigation! That's a clear and concise explanation.

Even if TSan were updated to correctly intercept all the legacy pthread_* calls, it could still fail to observe some synchronization primitives such as inlined atomics. TSan in general isn't guaranteed to play nicely with uninstrumented libraries (i.e., libraries that haven't been recompiled with TSan). [1]

Rather than fixing TSan to intercept the legacy pthread_* calls - which is only a partial fix - would it be feasible for NVIDIA to ship Vulkan libraries that are compiled with TSan (and the latest version of glibc)? (edit: not as a default, but as an optional download)


[1] https://github.com/google/sanitizers/wiki/ThreadSanitizerCppManual 'ThreadSanitizer generally requires all code to be compiled with -fsanitize=thread. If some code (e.g. dynamic libraries) is not compiled with the flag, it can lead to false positive race reports, false negative race reports and/or missed stack frames in reports depending on the nature of non-instrumented code. ... There are some precedents of making ThreadSanitizer work with non-instrumented libraries. Success of this highly depends on what exactly these libraries are doing.'