llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.06k stars 11.98k forks source link

Incorrect data sharing analysis leads to kernel crash #51451

Open 8a9ceb1a-b5cd-460e-ba77-4c79b2782c90 opened 3 years ago

8a9ceb1a-b5cd-460e-ba77-4c79b2782c90 commented 3 years ago
Bugzilla Link 52109
Version unspecified
OS All
CC @jhuber6,@jdoerfert,@shiltian

Extended Description

https://godbolt.org/z/4o7fbPbYW

First, the compiler finds data sharing, which is incorrect. There's no sharing of these stack variables.

example.cpp:820:62: remark: Found thread data sharing on the GPU. Expect degraded performance due to data globalization. [OMP112] [-Rpass-missed=openmp-opt] ColorSpinor<typename Arg::realIn, Arg::nColor, Arg::nSpin> in = arg.in(x_cb, (parity+arg.inParity)&1); ^ example.cpp:821:63: remark: Found thread data sharing on the GPU. Expect degraded performance due to data globalization. [OMP112] [-Rpass-missed=openmp-opt] ColorSpinor<typename Arg::realOut, Arg::nColor, Arg::nSpin> out; ^ example.cpp:665:8: remark: Found thread data sharing on the GPU. Expect degraded performance due to data globalization. [OMP112] [-Rpass-missed=openmp-opt] real v[length]; ^ example.cpp:682:8: remark: Found thread data sharing on the GPU. Expect degraded performance due to data globalization. [OMP112] [-Rpass-missed=openmp-opt] real v[length]; ^

Second, the generated code crashes in kernels, with

CUDA error: an illegal memory access was encountered

Close inspections (debug by commenting out code) show that some of the accesses to the above variables caused the illegal memory access.

8a9ceb1a-b5cd-460e-ba77-4c79b2782c90 commented 3 years ago

I found a way to convince the compiler that these stack variables are thread private.

diff --git a/tperf-copy.cc b/tperf-copy.cc index 2df18a0..d70d1b2 100644 --- a/tperf-copy.cc +++ b/tperf-copy.cc @@ -662,6 +662,7 @@ struct FloatNOrder { inline void load(complex out[length / 2], int x, int parity = 0) const { real v[length];

This gets the code compile and run with acceptable performance.

8a9ceb1a-b5cd-460e-ba77-4c79b2782c90 commented 3 years ago

the data sharing stack falls back to malloc which will just return a nullptr if the device runs out of global memory and crash.

This seems to be the case.

You should be able to increase the heap size with LIBOMPTARGET_HEAP_SIZE=<N>

The code runs with the heap size being set large enough. Expectedly, it also runs extremely slow.

jhuber6 commented 3 years ago

Is this a real illegal access or is the kernel just running out of memory. For NVPTX, the data sharing stack falls back to malloc which will just return a nullptr if the device runs out of global memory and crash. You should be able to increase the heap size with LIBOMPTARGET_HEAP_SIZE=<N> but I haven't used that option in awhile. Globalization can easily blow up the global memory if it's not removed since it's allocated per-thread and there's a lot of those. You can use -fopenmp-cuda-mode to bypass as well.

jdoerfert commented 3 years ago

We need to investigate our shared stack, it seems to not work. We also need to increase the heap-2-stack threshold for GPUs by a lot, probably set it to -1 through the driver.

The rest should be taken care of with Attributor enhancements very soon.