Open jason-infra opened 1 year ago
Standalone command to repro this:
#!/bin/bash
set -e
# Things you must change:
LIB_ASAN=/usr/lib/x86_64-linux-gnu/libasan.so.5
LIB_CUDART=cuda-11.2/targets/x86_64-linux/lib/libcudart.so.11.0
NVCC_PATH=/home/michael.johnson/cuda-11.2/bin/nvcc
CLANG10_PATH=/home/michael.johnson/clang/bin/clang
# End of things you must change.
# Create .h
cat <<EOT >> hello.h
void call_hello();
EOT
# Create .cu
cat <<EOT >> hello.cu
#include <cstdio>
__global__ void cuda_hello(){
printf("Hello World from GPU!\n");
}
void call_hello() {
cuda_hello<<<1,1>>>();
}
EOT
# Create .cpp
cat <<EOT >> hello_bin.cpp
#include "hello.h"
int main() {
call_hello();
return 1;
}
EOT
mkdir -p needed_libs/
cp $LIB_ASAN needed_libs/
cp $LIB_CUDART needed_libs/
$NVCC_PATH --objdir-as-tempdir --compiler-options "-fPIC" --compiler-bindir=$CLANG10_PATH -x cu -O2 -c hello.cu -o hello.pic.o
$CLANG10_PATH -shared -o libhello.so hello.pic.o -fsanitize=address -stdlib=libstdc++ -Lneeded_libs -lstdc++
cp libhello.so needed_libs/
$CLANG10_PATH -fPIC -nostdinc '-std=c++17' -nostdinc++ -c hello_bin.cpp -o hello_bin.pic.o
$CLANG10_PATH -o hello_bin -Lneeded_libs hello_bin.pic.o -lhello -l:libcudart.so.11.0 -l:libstdc++.so.6 -pie -fsanitize=address -fuse-ld=gold -stdlib=libstdc++ -lstdc++
LD_LIBRARY_PATH=needed_libs/ ./hello_bin 2>&1 | head
It seems the problem only happens when you use both clang and gold ld. I also encountered this issue on gcc, but it's weird I couldn't reproduce it now . As a temporary solution, you could try gcc or other ld.
Reproduce:
clang demo.cc -fsanitize=address -I/usr/local/cuda/include -lcuda -o demo -fuse-ld=gold
Remove -fuse-ld=gold
the program would work well.
// demo.cc
#include <cuda.h>
int main() {
cuInit(0);
return 0;
}
It's caused by duplicated invoking of InitializeShadowMemory
.
First invoking is before main()
, second is before cuInit(0)
.
The memory address of variable asan_inited
is different between the twice invoking, so the program consider asan uninitialized and call InitializeShadowMemory
in the second time. Then it try to allocate shadow memory on the same address i.e. 0x00007fff7000-0x10007fff7fff
and cause error.
But I'm still confused why there are two asan_inited
object, and why cuda driver & ld could effect it.
I found the root cause, there is below code in libcuda.so
Dl_info attr[2];
dladdr((void*)&pthread_join, attr);
dlopen(attr[0].dli_fname, 1);
So the program try to dlopen itself, and dlopen pie file is undefined behavior.
I have an ASAN test suite that gives me the following error:
I am running
IMPORTANTLY, All Asan test pass with no errors if I run the ASAN tests using nvidia 470 drivers:
Other Info
I have kept all variables the same, including Asan testsuite, OS, kernel, nvidia gpu t4. The only difference being the Nvidia drivers.
I do not believe -fsanitize=address is related in #856, because the test suite runs flawlessly with Nvidia 470 drivers.
All my other tests pass, including tsan.
Why am I still getting the "Shadow memory range interleaves" error?