8a9ceb1a-b5cd-460e-ba77-4c79b2782c90 commented 3 years ago


Bugzilla Link	52109
Version	unspecified
OS	All
CC	@jhuber6,@jdoerfert,@shiltian

Extended Description

https://godbolt.org/z/4o7fbPbYW

First, the compiler finds data sharing, which is incorrect. There's no sharing of these stack variables.

example.cpp:820:62: remark: Found thread data sharing on the GPU. Expect degraded performance due to data globalization. [OMP112] [-Rpass-missed=openmp-opt] ColorSpinor<typename Arg::realIn, Arg::nColor, Arg::nSpin> in = arg.in(x_cb, (parity+arg.inParity)&1); ^ example.cpp:821:63: remark: Found thread data sharing on the GPU. Expect degraded performance due to data globalization. [OMP112] [-Rpass-missed=openmp-opt] ColorSpinor<typename Arg::realOut, Arg::nColor, Arg::nSpin> out; ^ example.cpp:665:8: remark: Found thread data sharing on the GPU. Expect degraded performance due to data globalization. [OMP112] [-Rpass-missed=openmp-opt] real v[length]; ^ example.cpp:682:8: remark: Found thread data sharing on the GPU. Expect degraded performance due to data globalization. [OMP112] [-Rpass-missed=openmp-opt] real v[length]; ^

Second, the generated code crashes in kernels, with

CUDA error: an illegal memory access was encountered

Close inspections (debug by commenting out code) show that some of the accesses to the above variables caused the illegal memory access.

8a9ceb1a-b5cd-460e-ba77-4c79b2782c90 commented 3 years ago

I found a way to convince the compiler that these stack variables are thread private.

diff --git a/tperf-copy.cc b/tperf-copy.cc index 2df18a0..d70d1b2 100644 --- a/tperf-copy.cc +++ b/tperf-copy.cc @@ -662,6 +662,7 @@ struct FloatNOrder { inline void load(complex out[length / 2], int x, int parity = 0) const { real v[length];

pragma omp allocate(v) allocator(omp_thread_mem_alloc)
```
        norm_type nrm = 0.0;
        // #pragma unroll
        for (int i=0; i<M; i++) {
```
@@ -679,6 +680,7 @@ struct FloatNOrder { inline void save(const complex in[length / 2], int x, int parity = 0) const { real v[length];
pragma omp allocate(v) allocator(omp_thread_mem_alloc)
```
        // #pragma unroll
        for (int i = 0; i < length / 2; i++) {
```
@@ -817,7 +819,9 @@ template struct CopyColorSpinor_ { inline void operator()(int x_cb, int parity) { ColorSpinor<typename Arg::realIn, Arg::nColor, Arg::nSpin> in = arg.in(x_cb, (parity+arg.inParity)&1);

pragma omp allocate(in) allocator(omp_thread_mem_alloc)

        ColorSpinor<typename Arg::realOut, Arg::nColor, Arg::nSpin> out;

pragma omp allocate(out) allocator(omp_thread_mem_alloc)

        typename Arg::Basis basis;
        basis(out.data, in.data);
        arg.out(x_cb, (parity+arg.outParity)&1) = out;

This gets the code compile and run with acceptable performance.

8a9ceb1a-b5cd-460e-ba77-4c79b2782c90 commented 3 years ago

the data sharing stack falls back to malloc which will just return a nullptr if the device runs out of global memory and crash.

This seems to be the case.

You should be able to increase the heap size with LIBOMPTARGET_HEAP_SIZE=<N>

The code runs with the heap size being set large enough. Expectedly, it also runs extremely slow.

jhuber6 commented 3 years ago

Is this a real illegal access or is the kernel just running out of memory. For NVPTX, the data sharing stack falls back to malloc which will just return a nullptr if the device runs out of global memory and crash. You should be able to increase the heap size with LIBOMPTARGET_HEAP_SIZE=<N> but I haven't used that option in awhile. Globalization can easily blow up the global memory if it's not removed since it's allocated per-thread and there's a lot of those. You can use -fopenmp-cuda-mode to bypass as well.

jdoerfert commented 3 years ago

We need to investigate our shared stack, it seems to not work. We also need to increase the heap-2-stack threshold for GPUs by a lot, probably set it to -1 through the driver.

The rest should be taken care of with Attributor enhancements very soon.

llvm / llvm-project

Incorrect data sharing analysis leads to kernel crash #51451

Extended Description

pragma omp allocate(v) allocator(omp_thread_mem_alloc)

pragma omp allocate(v) allocator(omp_thread_mem_alloc)

pragma omp allocate(in) allocator(omp_thread_mem_alloc)

pragma omp allocate(out) allocator(omp_thread_mem_alloc)