trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.22k stars 568 forks source link

MueLu: MueLu_UnitTestsTpetra_kokkos_MPI_1,4 (cuda/11.2 Cuda and rocm/5.6.1 Hip builds) and MueLu_CreateOperatorTpetra_MPI_4 test failures (rocm/5.6.1 Hip builds only) #13310

Closed ndellingwood closed 2 months ago

ndellingwood commented 3 months ago

Bug Report

@trilinos/muelu

Description

The following tests fail in Hip builds of Trilinos with rocm/5.6.1

MueLu_UnitTestsTpetra_kokkos_MPI_1

...
23:30:32  Smoother (level 1) pre  : KLU2 solver interface
23:30:32  Smoother (level 1) post : no smoother
23:30:32  
23:30:32  =======================================================================================================================
23:30:32  
23:30:32                                           TimeMonitor results over 1 processor
23:30:32  
23:30:32  Timer Name                      Global time (num calls)    
23:30:32  -----------------------------------------------------------------------------------------------------------------------
23:30:32  MueLu setup time (Laplace1D)    0.0339 (1)                 
23:30:32  =======================================================================================================================
23:30:32  2 = 2 == H->GetGlobalNumLevels() = 2 : passed
23:30:32  Tpetra::Details::DeepCopyCounter::get_count_different_space() = 42 == targetNumDeepCopies = 19 : FAILED ==> /home/jenkins/caraway-new/workspace/Trilinos_Caraway_Hip_Serial_Rocm5_6_1_MI210/Trilinos/packages/muelu/test/unit_tests_kokkos/Regression.cpp:98
23:30:32  Tpetra::Details::DeepCopyCounter::get_count_different_space() = 2 == 2 = 2 : passed
23:30:32  *** Teuchos::StackedTimer::report() - Remainder for a level will be ***
23:30:32  *** incorrect if a timer in the level does not exist on every rank  ***
23:30:32  *** of the MPI Communicator.                                        ***
23:30:32  H2D: 0.0370237 [1]
...
23:30:32  [FAILED]  (0.0367 sec) Regression_double_int_longlong_Tpetra_KokkosCompat_KokkosHIPWrapperNode_H2D_UnitTest
23:30:32  Location: /home/jenkins/caraway-new/workspace/Trilinos_Caraway_Hip_Serial_Rocm5_6_1_MI210/Trilinos/packages/muelu/test/unit_tests_kokkos/Regression.cpp:31

The Tpetra::Details::DeepCopyCounter::get_count_different_space() indicates this might be related to #13292 ?

MueLu_CreateOperatorTpetra_MPI_4

...
23:18:51 ===================================== Solve 5: LevelWrap, Fast Way, P, R =====================================
23:18:51 --- kokkos/Output/operator_solve_1_np4_tpetra.gold_filtered    2024-07-30 23:18:46.070868930 -0600
23:18:51 +++ kokkos/Output/operator_solve_1_np4_tpetra.out_filtered 2024-07-30 23:18:46.074218077 -0600
23:18:51 @@ -114,7 +114,7 @@
23:18:51  --------------------------------------------------------------------------------
23:18:51  Number of levels    = 4
23:18:51  Operator complexity = 1.36
23:18:51 -Smoother complexity = <ignored>
23:18:51 +Smoother complexity = 1.61
23:18:51  Cycle type          = V
23:18:51  
23:18:51  level  rows   nnz    nnz/row  c ratio  procs
23:18:51 kokkos/Output/operator_solve_1_np4_tpetra: failed

Steps to Reproduce

  1. SHA1: [9764ffbb4dbf6fa3187b4896d1f971fe39db86f2]
  2. Configure script: caraway MI210 queue
    
    export TRILINOS_DIR=<path-to-your-repo>

module load python rocm/5.6.1 cmake openmpi/4.1.5 openblas/0.3.23 ninja/1.11.1 module list export OMPI_CXX=$ROCM_PATH/bin/hipcc export TPETRA_ASSUME_GPU_AWARE_MPI=1

CMake configuration

cmake \ -G"Ninja" \ -DCMAKE_INSTALL_PREFIX=$PWD/install \ -DCMAKE_CXX_STANDARD="17" \ -DCMAKE_CXX_COMPILER="which mpicxx" \ -DCMAKE_C_COMPILER="which mpicc" \ -DCMAKE_FORTRAN_COMPILER="which mpifort" \ -DCMAKE_BUILD_TYPE="RELEASE" \ -DBUILD_SHARED_LIBS=OFF \ \ -DTrilinos_ENABLE_ALL_PACKAGES=OFF \ -DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES=OFF \ -DTrilinos_ENABLE_EXPLICIT_INSTANTIATION=ON \ -DTrilinos_ASSERT_MISSING_PACKAGES=OFF \ -DTrilinos_ALLOW_NO_PACKAGES=OFF \ -DTrilinos_ENABLE_OpenMP=OFF \ -DTrilinos_ENABLE_TESTS=ON \ \ -DTrilinos_ENABLE_Amesos2=ON \ -DAmesos2_ENABLE_SuperLU=OFF \ -DAmesos2_ENABLE_KLU2=ON \ -DTrilinos_ENABLE_Belos=ON \ -DTrilinos_ENABLE_Ifpack2=ON \ -DTrilinos_ENABLE_Kokkos=ON \ -DKokkos_ARCH_VEGA90A=ON \ -DKokkos_ENABLE_CUDA=OFF \ -DKokkos_ENABLE_HIP=ON \ -DKokkos_ENABLE_OPENMP=OFF \ -DTrilinos_ENABLE_KokkosKernels=ON \ -DTrilinos_ENABLE_MueLu=ON \ -DTrilinos_ENABLE_Tpetra=ON \ -DTpetra_ENABLE_CUDA=OFF \ -DTpetra_INST_HIP=ON \ -DTpetra_INST_SERIAL=OFF \ -DTpetra_INST_OPENMP=OFF \ -DTpetra_INST_DOUBLE=ON \ -DTrilinos_ENABLE_Gtest=ON \ -DTrilinos_ENABLE_Teuchos=ON \ -DTrilinos_ENABLE_Xpetra=ON \ -DTrilinos_ENABLE_Zoltan2=ON \ -DTrilinos_ENABLE_Panzer=ON \ -DTPL_ENABLE_BLAS=ON \ -D BLAS_LIBRARY_DIRS:FILEPATH="${OPENBLAS_ROOT}/lib" \ -D BLAS_LIBRARY_NAMES:STRING="openblas" \ -DTPL_ENABLE_LAPACK=ON \ -D LAPACK_INCLUDE_DIRS:FILEPATH="${OPENBLAS_ROOT}/include" \ -D LAPACK_LIBRARY_DIRS:FILEPATH="${OPENBLAS_ROOT}/lib" \ -D LAPACK_LIBRARY_NAMES:STRING="openblas" \ -DTPL_ENABLE_Netcdf=OFF \ -DTPL_ENABLE_MPI=ON \ -DMPI_USE_COMPILER_WRAPPERS=ON \ -DMPI_EXEC="mpirun" \ -DMPI_EXEC_NUMPROCS_FLAG="-np" \ -DMPI_EXEC_POST_NUMPROCS_FLAGS:STRING="-bind-to;none" \ \ $TRILINOS_DIR

make -j16

ctest

github-actions[bot] commented 3 months ago

Automatic mention of the @trilinos/muelu team

github-actions[bot] commented 3 months ago

Automatic mention of the @trilinos/muelu team

ndellingwood commented 3 months ago

I'm also seeing failures of MueLu_UnitTestsTpetra_kokkos_MPI_1 and MueLu_UnitTestsTpetra_kokkos_MPI_4 with cuda/11.2.2 + gcc/8.5.0 Cuda builds, non-UVM build (for example on Weaver rhel8 queue, Power9+Volta70):

MueLu_UnitTestsTpetra_kokkos_MPI_1 summary

The following tests FAILED:
    132. Regression_double_int_longlong_Tpetra_KokkosCompat_KokkosCudaWrapperNode_H2D_UnitTest ...
    136. Regression_std_complex0double0_int_longlong_Tpetra_KokkosCompat_KokkosCudaWrapperNode_H2D_UnitTest ...
    143. SaPFactory_kokkos_double_int_longlong_Tpetra_KokkosCompat_KokkosCudaWrapperNode_ConstrainRowOptimalScalarPDE_UnitTest ...
    147. SaPFactory_kokkos_double_int_longlong_Tpetra_KokkosCompat_KokkosSerialWrapperNode_ConstrainRowOptimalScalarPDE_UnitTest ...

Regression_double_int_longlong_Tpetra_KokkosCompat_KokkosCudaWrapperNode_H2D_UnitTest:

...
 Smoother (level 1) pre  : KLU2 solver interface
 Smoother (level 1) post : no smoother

 =======================================================================================================================

                                          TimeMonitor results over 1 processor

 Timer Name                      Global time (num calls)
 -----------------------------------------------------------------------------------------------------------------------
 MueLu setup time (Laplace1D)    0.0223 (1)
 =======================================================================================================================
 2 = 2 == H->GetGlobalNumLevels() = 2 : passed
 Tpetra::Details::DeepCopyCounter::get_count_different_space() = 42 == targetNumDeepCopies = 34 : FAILED ==> /home/ndellin/trilinos/Trilinos-pristine/packages/muelu/test/unit_tests_kokkos/Regression.cpp:98
 Tpetra::Details::DeepCopyCounter::get_count_different_space() = 2 == 2 = 2 : passed
...

I am not seeing failures with MueLu_CreateOperatorTpetra_MPI_1 in the cuda/11.2 build

cgcgcg commented 3 months ago

@cwpearson It looks like these builds are seeing the same deep_copy counts as before #13052. Is it possible that the logic in #13052 is not quite correct, or that the TPL is not actually used for the spgemm?

I think this might be the issue: https://github.com/trilinos/Trilinos/blob/5eb4f1e73faf3127aced7b0f8712f499488a7aed/packages/muelu/test/unit_tests_kokkos/Regression.cpp#L97

ndellingwood commented 3 months ago

The MueLu_UnitTestsTpetra_kokkos_MPI_1,4 tests also fail with cuda/11.8 on Weaver as well

ndellingwood commented 3 months ago

@cgcgcg the MueLu_UnitTestsTpetra_kokkos_MPI_1* tests are passing now with #13313, thank you!

cgcgcg commented 3 months ago

@ndellingwood Is that all of them? Or is the CreateOperator one still failing?

ndellingwood commented 3 months ago

@cgcgcg the CreateOperator fail is not consistent, may be an artifact of the way I ran the tests, I'll monitor. The MueLu_UnitTestsTpetra_kokkos_MPI_1 failure is still showing up in cuda/11.2 build with kokkos@develop, I need to see if it is reproducible on Trilinos@develop without any updated kokkos version

ndellingwood commented 3 months ago

@cgcgcg I was able to reproduce the MueLu_UnitTestsTpetra_kokkos_MPI_1 failure with a6da8e51257f082621c65682b4793a70ca9163c8 on Trilinos@develop with cuda/11.2 (no kokkos updates), this was on weaver

Same failures as https://github.com/trilinos/Trilinos/issues/13310#issuecomment-2261598735

This is a reproducer setup for weaver (rhel8 queue):

# Interactive compute node
bsub -Is -n 1 -q rhel8 -gpu "num=4" bash

export TRILINOS_DIR=<your-path-to-source>
export KOKKOS_PATH=$TRILINOS_DIR/packages/kokkos

export ATDM_CONFIG_REGISTER_CUSTOM_CONFIG_DIR=${TRILINOS_DIR}/cmake/std/atdm/contributed/weaver
source ${TRILINOS_DIR}/cmake/std/atdm/load-env.sh weaver-cuda-11.2-opt
export OMPI_CXX="$KOKKOS_PATH/bin/nvcc_wrapper"

cmake \
      -D CMAKE_CXX_STANDARD="17" \
      -D CMAKE_INSTALL_PREFIX=$PWD/install \
      -D Trilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
      -D TPL_ENABLE_CUSPARSE:BOOL=ON \
      -DTrilinos_ENABLE_TESTS=ON \
      -DTrilinos_ENABLE_ALL_PACKAGES=ON \
      -DTrilinos_ENABLE_COMPLEX_DOUBLE=ON \
      \
      -D Trilinos_ENABLE_Kokkos=ON \
      -D Kokkos_ARCH_VOLTA70=ON \
      -D Kokkos_ARCH_POWER9=ON \
      -D Kokkos_ENABLE_CUDA=ON \
      -D Kokkos_ENABLE_CUDA_LAMBDA=ON \
      -D Kokkos_ENABLE_CUDA_UVM=OFF \
      -D Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF \
      -DTrilinos_ENABLE_Tpetra=ON \
      -D Tpetra_ENABLE_TESTS=ON \
      -DTrilinos_ENABLE_Stokhos=ON \
      -D MueLu_ENABLE_TESTS=ON \
      \
$TRILINOS_DIR

make -j16

ctest -R MueLu_UnitTestsTpetra_kokkos_MPI_1

Edit: MueLu_UnitTestsTpetra_kokkos_MPI_4 exhibits similar failure

cwpearson commented 3 months ago

This weaver failure is almost certainly because for CUDA >= 11 and < 11.4 Kokkos Kernels doesn't use cuSparse SpGEMM, so there are more deep-copies than we'd otherwise expect.

https://github.com/kokkos/kokkos-kernels/blob/fd919af1190dc0f39045585faa9c92cbb842dc19/sparse/tpls/KokkosSparse_spgemm_symbolic_tpl_spec_avail.hpp#L64-L69

The regression test logic should be updated to reflect this.

ndellingwood commented 2 months ago

MueLu_UnitTestsTpetra_kokkos_MPI_1 and MueLu_UnitTestsTpetra_kokkos_MPI_4 have been passing since #13356 , thanks @cwpearson