Open cenomla opened 1 year ago
+1
Had to let tsan lay dormant for the time being but now I have to come back to it eventually:
The problem at __interceptor_calloc
(and potentially others) originates from the ThreadState *thr = cur_thread_init()
in macro SCOPED_INTERCEPTOR_RAW
.
The ThreadState
in this stack created by the Nvidia library (?) is not "initialized" and the thread local block is used with all null values. Thus it has not ever got a Processor
set and returns nullptr. The implementations in tsan_interceptors_posix.cpp
does not check/assert for this case and crashes on the first access to the Procecssor
.
I tried to "lazy" initialize the ThreadState
to a in the SCOPED_INTERCEPTOR_RAW
macro but did not understand enough of the codebase to succeed right away. I tried using a ScopedProcessor
in this case which got me over the crash on calloc
but later failing on free
as the Processor was a different scoped one then (?).
As far as I can see the ScopedInterceptor
calls if (!thr_->is_inited) return;
in its constructor so early that it is effectively bypassing possible ignore logic. But ignoring may or may not be of any help here anyway...
The intention is that sanitizers runtimes are initialized before any other code. IIRC on linux we should try to initialize from .preinit_array. Not sure why glcore.so should run earlier, I would assume it's not doing anything this low-level.
Part of the whole vulkan infrastructure here is a dynamic loader for layers and the actual driver. It should not be running as early but maybe the (potentially nested) dlopen()/dlsym() may cause these issues?
I put together a small gist allowing to reproduce the issue. Tested on Arch Linux 2024-01-19 with NVIDIA driver 545.29.06 and clang 16.0.6.
Update: NVIDIA said they can reproduce this and the vulkan engineering team will be investigating this.
Hello, I'm an NVIDIA engineer investigating this bug.
The issue arises because malloc/calloc/free
calls in the driver are correctly intercepted by TSAN but pthread_*
calls are not. The NVIDIA driver is built with older glibc headers, which we need in order to maintain support for older linux distributions. Because of this, the driver picks up old versions of libpthread that TSAN does not interpose. Conversely malloc/calloc/free
only have one version, so interposition is unhindered. When a pthread created without TSAN's pthread_create
interceptor enters TSAN's calloc interceptor, that ThreadState
struct mentioned above is uninitialized, resulting in the segfault.
I'm wondering what can be done on the TSAN side to account for this.
CC @thurstond
Any update on this? TSAN is pretty much unusable for any application using OpenGL or Vulkan on NVIDIA right now.
@michaelkorenchan
The issue arises because malloc/calloc/free calls in the driver are correctly intercepted by TSAN but pthread_* calls are not. The NVIDIA driver is built with older glibc headers, which we need in order to maintain support for older linux distributions. Because of this, the driver picks up old versions of libpthread that TSAN does not interpose. Conversely malloc/calloc/free only have one version, so interposition is unhindered. When a pthread created without TSAN's pthread_create interceptor enters TSAN's calloc interceptor, that ThreadState struct mentioned above is uninitialized, resulting in the segfault.
I'm wondering what can be done on the TSAN side to account for this.
Thanks for the investigation! That's a clear and concise explanation.
Even if TSan were updated to correctly intercept all the legacy pthread_*
calls, it could still fail to observe some synchronization primitives such as inlined atomics. TSan in general isn't guaranteed to play nicely with uninstrumented libraries (i.e., libraries that haven't been recompiled with TSan). [1]
Rather than fixing TSan to intercept the legacy pthread_*
calls - which is only a partial fix - would it be feasible for NVIDIA to ship Vulkan libraries that are compiled with TSan (and the latest version of glibc)? (edit: not as a default, but as an optional download)
[1] https://github.com/google/sanitizers/wiki/ThreadSanitizerCppManual 'ThreadSanitizer generally requires all code to be compiled with -fsanitize=thread. If some code (e.g. dynamic libraries) is not compiled with the flag, it can lead to false positive race reports, false negative race reports and/or missed stack frames in reports depending on the nature of non-instrumented code. ... There are some precedents of making ThreadSanitizer work with non-instrumented libraries. Success of this highly depends on what exactly these libraries are doing.'
My program crashes (
SIGSEGV
) on a memory read inside oflibnvidia-glcore.so
when callingvkCreateDevice
, even when using a suppressions file that tries to ignore calls from that libStack trace:
Suppressions file:
My program runs fine under ASAN. TSAN ends up mmaping 124T of memory before crashing but my ulimit is set to unlimited for the current shell so I don't think that's the issue. Was compiled using gcc 13.1.1