llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.04k stars 11.98k forks source link

nvptx64-nvidia-cuda::bug49334.cpp failing again: Assertion `std::fabs(bm_value - m_value) < std::numeric_limits<float>::epsilon()' failed. #54081

Open Meinersbur opened 2 years ago

Meinersbur commented 2 years ago

First appearing in https://lab.llvm.org/staging/#/builders/155/builds/2424 https://lab.llvm.org/staging/#/builders/154/builds/1838

******************** TEST 'libomptarget :: nvptx64-nvidia-cuda :: offloading/bug49334.cpp' FAILED ********************
Script:
--
: 'RUN: at line 1';   /home/buildbot-worker/minipc-1050ti-linux/rundir/llvm.build/./bin/clang++ -fopenmp  -fno-experimental-isel --cuda-path=/opt/cuda  -I /home/buildbot-worker/minipc-1050ti-linux/rundir/llvm.src/openmp/libomptarget/test -I /home/buildbot-worker/minipc-1050ti-linux/rundir/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -L /home/buildbot-worker/minipc-1050ti-linux/rundir/llvm.build/runtimes/runtimes-bins/openmp/libomptarget -L /home/buildbot-worker/minipc-1050ti-linux/rundir/llvm.build/runtimes/runtimes-bins/openmp/runtime/src  -fno-openmp-implicit-rpath -Wl,-rpath,/home/buildbot-worker/minipc-1050ti-linux/rundir/llvm.build/runtimes/runtimes-bins/openmp/libomptarget -Wl,-rpath,/home/buildbot-worker/minipc-1050ti-linux/rundir/llvm.build/runtimes/runtimes-bins/openmp/runtime/src -Wl,-rpath,/opt/cuda/lib64 --libomptarget-nvptx-bc-path=/home/buildbot-worker/minipc-1050ti-linux/rundir/llvm.build/runtimes/runtimes-bins/openmp/libomptarget -fopenmp-targets=nvptx64-nvidia-cuda /home/buildbot-worker/src/llvm-project/openmp/libomptarget/test/offloading/bug49334.cpp -o /home/buildbot-worker/minipc-1050ti-linux/openmp-offload-cuda-runtime/llvm.build/runtimes/runtimes-bins/openmp/libomptarget/test/nvptx64-nvidia-cuda/offloading/Output/bug49334.cpp.tmp && /home/buildbot-worker/minipc-1050ti-linux/openmp-offload-cuda-runtime/llvm.build/runtimes/runtimes-bins/openmp/libomptarget/test/nvptx64-nvidia-cuda/offloading/Output/bug49334.cpp.tmp | /home/buildbot-worker/minipc-1050ti-linux/rundir/llvm.build/./bin/FileCheck /home/buildbot-worker/src/llvm-project/openmp/libomptarget/test/offloading/bug49334.cpp
--
Exit Code: 2
Command Output (stdout):
--
$ ":" "RUN: at line 1"
$ "/home/buildbot-worker/minipc-1050ti-linux/rundir/llvm.build/./bin/clang++" "-fopenmp" "-fno-experimental-isel" "--cuda-path=/opt/cuda" "-I" "/home/buildbot-worker/minipc-1050ti-linux/rundir/llvm.src/openmp/libomptarget/test" "-I" "/home/buildbot-worker/minipc-1050ti-linux/rundir/llvm.build/runtimes/runtimes-bins/openmp/runtime/src" "-L" "/home/buildbot-worker/minipc-1050ti-linux/rundir/llvm.build/runtimes/runtimes-bins/openmp/libomptarget" "-L" "/home/buildbot-worker/minipc-1050ti-linux/rundir/llvm.build/runtimes/runtimes-bins/openmp/runtime/src" "-fno-openmp-implicit-rpath" "-Wl,-rpath,/home/buildbot-worker/minipc-1050ti-linux/rundir/llvm.build/runtimes/runtimes-bins/openmp/libomptarget" "-Wl,-rpath,/home/buildbot-worker/minipc-1050ti-linux/rundir/llvm.build/runtimes/runtimes-bins/openmp/runtime/src" "-Wl,-rpath,/opt/cuda/lib64" "--libomptarget-nvptx-bc-path=/home/buildbot-worker/minipc-1050ti-linux/rundir/llvm.build/runtimes/runtimes-bins/openmp/libomptarget" "-fopenmp-targets=nvptx64-nvidia-cuda" "/home/buildbot-worker/src/llvm-project/openmp/libomptarget/test/offloading/bug49334.cpp" "-o" "/home/buildbot-worker/minipc-1050ti-linux/openmp-offload-cuda-runtime/llvm.build/runtimes/runtimes-bins/openmp/libomptarget/test/nvptx64-nvidia-cuda/offloading/Output/bug49334.cpp.tmp"
$ "/home/buildbot-worker/minipc-1050ti-linux/openmp-offload-cuda-runtime/llvm.build/runtimes/runtimes-bins/openmp/libomptarget/test/nvptx64-nvidia-cuda/offloading/Output/bug49334.cpp.tmp"
# command stderr:
bug49334.cpp.tmp: /home/buildbot-worker/src/llvm-project/openmp/libomptarget/test/offloading/bug49334.cpp:64: void BlockMatrix::Compare(const std::vector<float> &) const: Assertion `std::fabs(bm_value - m_value) < std::numeric_limits<float>::epsilon()' failed.
error: command failed with exit status: -6
$ "/home/buildbot-worker/minipc-1050ti-linux/rundir/llvm.build/./bin/FileCheck" "/home/buildbot-worker/src/llvm-project/openmp/libomptarget/test/offloading/bug49334.cpp"
# command stderr:
FileCheck error: '<stdin>' is empty.
FileCheck command line:  /home/buildbot-worker/minipc-1050ti-linux/rundir/llvm.build/./bin/FileCheck /home/buildbot-worker/src/llvm-project/openmp/libomptarget/test/offloading/bug49334.cpp
error: command failed with exit status: 2
--
********************

The failure is flaky, so the problem might have been introduced earlier already.

Previous failures of bug49334: #53730 #50083 #49284

llvmbot commented 2 years ago

@llvm/issue-subscribers-openmp

shiltian commented 2 years ago

I can get one failure out of about 100+ runs. I'll defer the fix until we get other reports that are more easier to be reproduced.

shiltian commented 2 years ago

I think this problem is more likely to be about bugs in existing tasking implementation. I made some changes to the function BlockMatMul_TargetNowait:

int BlockMatMul_TargetNowait(BlockMatrix &A, BlockMatrix &B, BlockMatrix &C) {
#pragma omp parallel
  {
    int tid = omp_get_thread_num();
    if (tid == 0) {
      for (int i = 0; i < N / BS; ++i)
        for (int j = 0; j < N / BS; ++j) {
          int *BlockC = C.GetBlock(i, j);
          for (int k = 0; k < N / BS; ++k) {
            int *BlockA = A.GetBlock(i, k);
            int *BlockB = B.GetBlock(k, j);
#pragma omp task depend(in : BlockA[0], BlockB[0]) depend(inout : BlockC[0])
            {
// clang-format off
#pragma omp target map(to: BlockA[:BS * BS], BlockB[:BS * BS])                 \
                   map(tofrom: BlockC[:BS * BS])
// clang-format on
#pragma omp parallel for
              for (int ii = 0; ii < BS; ii++)
                for (int jj = 0; jj < BS; jj++)
                  for (int kk = 0; kk < BS; ++kk)
                    BlockC[ii + jj * BS] +=
                        BlockA[ii + kk * BS] * BlockB[kk + jj * BS];
            }
          }
        }
    }
  }
  return 0;
}

No nowait is used, which means no hidden helper task is involved. Pure regular OpenMP task now, and I can still observe failure.

Meinersbur commented 2 years ago

I can get one failure out of about 100+ runs. I'll defer the fix until we get other reports that are more easier to be reproduced.

With previous failures of bug49334, the following increased how often the problem occured:

  1. Reducing N
  2. Run lit with parallelism, i.e. running multiple other offloading tests at the same time.
  3. Run the same bug49334 executable in parallel. Launching 3 of them was usually sufficient.

The builder ran without failure for ~6 days (http://meinersbur.de:8011/#/builders/11 and now https://lab.llvm.org/staging/#/builders/155/builds/2424) after the last problem was fixed, and now failing regularly. A recent commit might have introduced the problem.

Meinersbur commented 2 years ago

I bisected the problem which resulted in 75812e7704fcb7cf89e107cc773c5a2118f380ab. Cherry-picking it on top of 092a5bb72ba8cc5a6cec02cfe61f70130a2c1282 also shows the error, so the issue actually was always there.

shiltian commented 2 years ago

Like I mentioned https://github.com/llvm/llvm-project/issues/54081#issuecomment-1051093259, it could be possible that the issue is in existing tasking implementation.

shiltian commented 10 months ago

The test is disabled for now. Remove myself in case anyone else is interested.