ginkgo-project / ginkgo

Numerical linear algebra software package
https://ginkgo-project.github.io/
BSD 3-Clause "New" or "Revised" License
401 stars 88 forks source link

Ginkgo 1.7.0 tests capture stderr and fail due to different number of mpirun warnings #1567

Open lahwaacz opened 6 months ago

lahwaacz commented 6 months ago

Hi,

I'm creating a stable ginkgo-hpc package for Arch Linux and I'm getting some issues. Besides #1564, #1566 and #1143, there are some tests that fail with the following error:

281/285 Test #283: benchmark_multi_vector_distributed .......................***Failed    1.27 sec
TEST: '/usr/bin/mpiexec' '-n' '3' '/build/ginkgo-hpc/src/build/benchmark/blas/distributed/multi_vector_distributed' '-input' '[{"n": 100}]'
FAIL: stderr differs
---

+++

@@ -1,3 +1,6 @@

+[arch-nspawn-268570:99043] No HIP capabale device found. Disabling component.
+[arch-nspawn-268570:99045] No HIP capabale device found. Disabling component.
+[arch-nspawn-268570:99044] No HIP capabale device found. Disabling component.
 This is Ginkgo 1.7.0 (master)
     running with core module 1.7.0 (master)
 Running on reference(0)

282/285 Test #284: benchmark_spmv_distributed ...............................***Failed    1.27 sec
TEST: '/usr/bin/mpiexec' '-n' '3' '/build/ginkgo-hpc/src/build/benchmark/spmv/distributed/spmv_distributed' '-input' '[{"size": 100, "stencil": "7pt", "comm_pattern": "stencil"}]'
FAIL: stderr differs
---

+++

@@ -1,3 +1,6 @@

+[arch-nspawn-268570:99066] No HIP capabale device found. Disabling component.
+[arch-nspawn-268570:99065] No HIP capabale device found. Disabling component.
+[arch-nspawn-268570:99064] No HIP capabale device found. Disabling component.
 This is Ginkgo 1.7.0 (master)
     running with core module 1.7.0 (master)
 Running on reference(0)

283/285 Test #285: benchmark_solver_distributed .............................***Failed    1.21 sec
TEST: '/build/ginkgo-hpc/src/build/benchmark/solver/distributed/solver_distributed' '-input' '[{"size": 100, "stencil": "7pt", "comm_pattern": "stencil", "optimal": {"spmv": "csr-csr"}}]'
FAIL: stderr differs
---

+++

@@ -1,3 +1,4 @@

+[arch-nspawn-268570:99060] No HIP capabale device found. Disabling component.
 This is Ginkgo 1.7.0 (master)
     running with core module 1.7.0 (master)
 Running on reference(0)

The build system has no GPU, but ROCm/HIP is installed for building the -hip variant of the package. But these tests are built with -DGINKGO_BUILD_HIP=OFF (I know it is pointless to run HIP tests without a GPU).

Arch Linux has ROCm-aware OpenMPI 5.0 and it is responsible for printing the No HIP capabale device found. Disabling component. message from each rank. Hence, if you compare the output of a serial test with that run through mpirun, there will necessarily be a difference. The tests should be designed better, assuming that the MPI library itself does not print anything is rather naive.

upsj commented 6 months ago

I would suggest disabling the corresponding tests using ctest -E benchmark_.*_distributed in the short term, changing this behavior would require some refactoring of the benchmark code that we can't prioritize immediately. The benchmarks are not designed for easy testability, the tests were added after the fact to enable some refactoring, so they are mainly intended for us developers.