Closed ndellingwood closed 2 months ago
Automatic mention of the @trilinos/muelu team
Automatic mention of the @trilinos/muelu team
I'm also seeing failures of MueLu_UnitTestsTpetra_kokkos_MPI_1
and MueLu_UnitTestsTpetra_kokkos_MPI_4
with cuda/11.2.2 + gcc/8.5.0 Cuda builds, non-UVM build (for example on Weaver rhel8 queue, Power9+Volta70):
MueLu_UnitTestsTpetra_kokkos_MPI_1 summary
The following tests FAILED:
132. Regression_double_int_longlong_Tpetra_KokkosCompat_KokkosCudaWrapperNode_H2D_UnitTest ...
136. Regression_std_complex0double0_int_longlong_Tpetra_KokkosCompat_KokkosCudaWrapperNode_H2D_UnitTest ...
143. SaPFactory_kokkos_double_int_longlong_Tpetra_KokkosCompat_KokkosCudaWrapperNode_ConstrainRowOptimalScalarPDE_UnitTest ...
147. SaPFactory_kokkos_double_int_longlong_Tpetra_KokkosCompat_KokkosSerialWrapperNode_ConstrainRowOptimalScalarPDE_UnitTest ...
Regression_double_int_longlong_Tpetra_KokkosCompat_KokkosCudaWrapperNode_H2D_UnitTest:
...
Smoother (level 1) pre : KLU2 solver interface
Smoother (level 1) post : no smoother
=======================================================================================================================
TimeMonitor results over 1 processor
Timer Name Global time (num calls)
-----------------------------------------------------------------------------------------------------------------------
MueLu setup time (Laplace1D) 0.0223 (1)
=======================================================================================================================
2 = 2 == H->GetGlobalNumLevels() = 2 : passed
Tpetra::Details::DeepCopyCounter::get_count_different_space() = 42 == targetNumDeepCopies = 34 : FAILED ==> /home/ndellin/trilinos/Trilinos-pristine/packages/muelu/test/unit_tests_kokkos/Regression.cpp:98
Tpetra::Details::DeepCopyCounter::get_count_different_space() = 2 == 2 = 2 : passed
...
I am not seeing failures with MueLu_CreateOperatorTpetra_MPI_1
in the cuda/11.2 build
@cwpearson It looks like these builds are seeing the same deep_copy counts as before #13052. Is it possible that the logic in #13052 is not quite correct, or that the TPL is not actually used for the spgemm?
I think this might be the issue: https://github.com/trilinos/Trilinos/blob/5eb4f1e73faf3127aced7b0f8712f499488a7aed/packages/muelu/test/unit_tests_kokkos/Regression.cpp#L97
The MueLu_UnitTestsTpetra_kokkos_MPI_1,4
tests also fail with cuda/11.8 on Weaver as well
@cgcgcg the MueLu_UnitTestsTpetra_kokkos_MPI_1*
tests are passing now with #13313, thank you!
@ndellingwood Is that all of them? Or is the CreateOperator one still failing?
@cgcgcg the CreateOperator fail is not consistent, may be an artifact of the way I ran the tests, I'll monitor.
The MueLu_UnitTestsTpetra_kokkos_MPI_1
failure is still showing up in cuda/11.2 build with kokkos@develop, I need to see if it is reproducible on Trilinos@develop without any updated kokkos version
@cgcgcg I was able to reproduce the MueLu_UnitTestsTpetra_kokkos_MPI_1
failure with a6da8e51257f082621c65682b4793a70ca9163c8 on Trilinos@develop with cuda/11.2 (no kokkos updates), this was on weaver
Same failures as https://github.com/trilinos/Trilinos/issues/13310#issuecomment-2261598735
This is a reproducer setup for weaver (rhel8 queue):
# Interactive compute node
bsub -Is -n 1 -q rhel8 -gpu "num=4" bash
export TRILINOS_DIR=<your-path-to-source>
export KOKKOS_PATH=$TRILINOS_DIR/packages/kokkos
export ATDM_CONFIG_REGISTER_CUSTOM_CONFIG_DIR=${TRILINOS_DIR}/cmake/std/atdm/contributed/weaver
source ${TRILINOS_DIR}/cmake/std/atdm/load-env.sh weaver-cuda-11.2-opt
export OMPI_CXX="$KOKKOS_PATH/bin/nvcc_wrapper"
cmake \
-D CMAKE_CXX_STANDARD="17" \
-D CMAKE_INSTALL_PREFIX=$PWD/install \
-D Trilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-D TPL_ENABLE_CUSPARSE:BOOL=ON \
-DTrilinos_ENABLE_TESTS=ON \
-DTrilinos_ENABLE_ALL_PACKAGES=ON \
-DTrilinos_ENABLE_COMPLEX_DOUBLE=ON \
\
-D Trilinos_ENABLE_Kokkos=ON \
-D Kokkos_ARCH_VOLTA70=ON \
-D Kokkos_ARCH_POWER9=ON \
-D Kokkos_ENABLE_CUDA=ON \
-D Kokkos_ENABLE_CUDA_LAMBDA=ON \
-D Kokkos_ENABLE_CUDA_UVM=OFF \
-D Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF \
-DTrilinos_ENABLE_Tpetra=ON \
-D Tpetra_ENABLE_TESTS=ON \
-DTrilinos_ENABLE_Stokhos=ON \
-D MueLu_ENABLE_TESTS=ON \
\
$TRILINOS_DIR
make -j16
ctest -R MueLu_UnitTestsTpetra_kokkos_MPI_1
Edit: MueLu_UnitTestsTpetra_kokkos_MPI_4
exhibits similar failure
This weaver failure is almost certainly because for CUDA >= 11 and < 11.4 Kokkos Kernels doesn't use cuSparse SpGEMM, so there are more deep-copies than we'd otherwise expect.
The regression test logic should be updated to reflect this.
MueLu_UnitTestsTpetra_kokkos_MPI_1
and MueLu_UnitTestsTpetra_kokkos_MPI_4
have been passing since #13356 , thanks @cwpearson
Bug Report
@trilinos/muelu
Description
The following tests fail in Hip builds of Trilinos with rocm/5.6.1
MueLu_UnitTestsTpetra_kokkos_MPI_1
The
Tpetra::Details::DeepCopyCounter::get_count_different_space()
indicates this might be related to #13292 ?MueLu_CreateOperatorTpetra_MPI_4
Steps to Reproduce
module load python rocm/5.6.1 cmake openmpi/4.1.5 openblas/0.3.23 ninja/1.11.1 module list export OMPI_CXX=$ROCM_PATH/bin/hipcc export TPETRA_ASSUME_GPU_AWARE_MPI=1
CMake configuration
cmake \ -G"Ninja" \ -DCMAKE_INSTALL_PREFIX=$PWD/install \ -DCMAKE_CXX_STANDARD="17" \ -DCMAKE_CXX_COMPILER="
which mpicxx
" \ -DCMAKE_C_COMPILER="which mpicc
" \ -DCMAKE_FORTRAN_COMPILER="which mpifort
" \ -DCMAKE_BUILD_TYPE="RELEASE" \ -DBUILD_SHARED_LIBS=OFF \ \ -DTrilinos_ENABLE_ALL_PACKAGES=OFF \ -DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES=OFF \ -DTrilinos_ENABLE_EXPLICIT_INSTANTIATION=ON \ -DTrilinos_ASSERT_MISSING_PACKAGES=OFF \ -DTrilinos_ALLOW_NO_PACKAGES=OFF \ -DTrilinos_ENABLE_OpenMP=OFF \ -DTrilinos_ENABLE_TESTS=ON \ \ -DTrilinos_ENABLE_Amesos2=ON \ -DAmesos2_ENABLE_SuperLU=OFF \ -DAmesos2_ENABLE_KLU2=ON \ -DTrilinos_ENABLE_Belos=ON \ -DTrilinos_ENABLE_Ifpack2=ON \ -DTrilinos_ENABLE_Kokkos=ON \ -DKokkos_ARCH_VEGA90A=ON \ -DKokkos_ENABLE_CUDA=OFF \ -DKokkos_ENABLE_HIP=ON \ -DKokkos_ENABLE_OPENMP=OFF \ -DTrilinos_ENABLE_KokkosKernels=ON \ -DTrilinos_ENABLE_MueLu=ON \ -DTrilinos_ENABLE_Tpetra=ON \ -DTpetra_ENABLE_CUDA=OFF \ -DTpetra_INST_HIP=ON \ -DTpetra_INST_SERIAL=OFF \ -DTpetra_INST_OPENMP=OFF \ -DTpetra_INST_DOUBLE=ON \ -DTrilinos_ENABLE_Gtest=ON \ -DTrilinos_ENABLE_Teuchos=ON \ -DTrilinos_ENABLE_Xpetra=ON \ -DTrilinos_ENABLE_Zoltan2=ON \ -DTrilinos_ENABLE_Panzer=ON \ -DTPL_ENABLE_BLAS=ON \ -D BLAS_LIBRARY_DIRS:FILEPATH="${OPENBLAS_ROOT}/lib" \ -D BLAS_LIBRARY_NAMES:STRING="openblas" \ -DTPL_ENABLE_LAPACK=ON \ -D LAPACK_INCLUDE_DIRS:FILEPATH="${OPENBLAS_ROOT}/include" \ -D LAPACK_LIBRARY_DIRS:FILEPATH="${OPENBLAS_ROOT}/lib" \ -D LAPACK_LIBRARY_NAMES:STRING="openblas" \ -DTPL_ENABLE_Netcdf=OFF \ -DTPL_ENABLE_MPI=ON \ -DMPI_USE_COMPILER_WRAPPERS=ON \ -DMPI_EXEC="mpirun" \ -DMPI_EXEC_NUMPROCS_FLAG="-np" \ -DMPI_EXEC_POST_NUMPROCS_FLAGS:STRING="-bind-to;none" \ \ $TRILINOS_DIRmake -j16
ctest