Open 8a9ceb1a-b5cd-460e-ba77-4c79b2782c90 opened 3 years ago
I found a way to convince the compiler that these stack variables are thread private.
diff --git a/tperf-copy.cc b/tperf-copy.cc index 2df18a0..d70d1b2 100644 --- a/tperf-copy.cc +++ b/tperf-copy.cc @@ -662,6 +662,7 @@ struct FloatNOrder { inline void load(complex out[length / 2], int x, int parity = 0) const { real v[length];
norm_type nrm = 0.0;
// #pragma unroll
for (int i=0; i<M; i++) {
@@ -679,6 +680,7 @@ struct FloatNOrder { inline void save(const complex in[length / 2], int x, int parity = 0) const { real v[length];
// #pragma unroll
for (int i = 0; i < length / 2; i++) {
@@ -817,7 +819,9 @@ template
ColorSpinor<typename Arg::realOut, Arg::nColor, Arg::nSpin> out;
typename Arg::Basis basis;
basis(out.data, in.data);
arg.out(x_cb, (parity+arg.outParity)&1) = out;
This gets the code compile and run with acceptable performance.
the data sharing stack falls back to malloc which will just return a nullptr if the device runs out of global memory and crash.
This seems to be the case.
You should be able to increase the heap size with
LIBOMPTARGET_HEAP_SIZE=<N>
The code runs with the heap size being set large enough. Expectedly, it also runs extremely slow.
Is this a real illegal access or is the kernel just running out of memory. For NVPTX, the data sharing stack falls back to malloc which will just return a nullptr if the device runs out of global memory and crash. You should be able to increase the heap size with LIBOMPTARGET_HEAP_SIZE=<N>
but I haven't used that option in awhile. Globalization can easily blow up the global memory if it's not removed since it's allocated per-thread and there's a lot of those. You can use -fopenmp-cuda-mode
to bypass as well.
We need to investigate our shared stack, it seems to not work. We also need to increase the heap-2-stack threshold for GPUs by a lot, probably set it to -1 through the driver.
The rest should be taken care of with Attributor enhancements very soon.
Extended Description
https://godbolt.org/z/4o7fbPbYW
First, the compiler finds data sharing, which is incorrect. There's no sharing of these stack variables.
example.cpp:820:62: remark: Found thread data sharing on the GPU. Expect degraded performance due to data globalization. [OMP112] [-Rpass-missed=openmp-opt] ColorSpinor<typename Arg::realIn, Arg::nColor, Arg::nSpin> in = arg.in(x_cb, (parity+arg.inParity)&1); ^ example.cpp:821:63: remark: Found thread data sharing on the GPU. Expect degraded performance due to data globalization. [OMP112] [-Rpass-missed=openmp-opt] ColorSpinor<typename Arg::realOut, Arg::nColor, Arg::nSpin> out; ^ example.cpp:665:8: remark: Found thread data sharing on the GPU. Expect degraded performance due to data globalization. [OMP112] [-Rpass-missed=openmp-opt] real v[length]; ^ example.cpp:682:8: remark: Found thread data sharing on the GPU. Expect degraded performance due to data globalization. [OMP112] [-Rpass-missed=openmp-opt] real v[length]; ^
Second, the generated code crashes in kernels, with
CUDA error: an illegal memory access was encountered
Close inspections (debug by commenting out code) show that some of the accesses to the above variables caused the illegal memory access.