trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.21k stars 563 forks source link

MueLu: MueLu_UnitTestsTpetra_MPI_1, MueLu_UnitTestsTpetra_kokkos_MPI_{1,4} fails in Cuda builds with UVM #12865

Open ndellingwood opened 7 months ago

ndellingwood commented 7 months ago

Bug Report

@trilinos/muelu

Description

The MueLu_UnitTestsTpetra_MPI_1 and MueLu_UnitTestsTpetra_kokkos_MPI_{1,4} tests are failing a couple checks in cuda/11.2 builds with UVM enabled:

MueLu_UnitTestsTpetra_MPI_1 failing checks:

1: The following tests FAILED:
1:     678. Regression_double_int_longlong_Tpetra_KokkosCompat_KokkosCudaWrapperNode_H2D_UnitTest ... 
1:     680. Regression_std_complex0double0_int_longlong_Tpetra_KokkosCompat_KokkosCudaWrapperNode_H2D_UnitTest ... 

MueLu_UnitTestsTpetra_MPI_1 more details:

Regression_double_int_longlong_Tpetra_KokkosCompat_KokkosCudaWrapperNode_H2D_UnitTest
...
1:  Tpetra::Details::DeepCopyCounter::get_count_different_space() = 20 == 37 = 37 : FAILED ==> /home/ndellin/trilinos/Trilinos/packages/muelu/test/unit_tests/Regression.cpp:109
...

Regression_std_complex0double0_int_longlong_Tpetra_KokkosCompat_KokkosCudaWrapperNode_H2D_UnitTest
...
1:  Tpetra::Details::DeepCopyCounter::get_count_different_space() = 20 == 37 = 37 : FAILED ==> /home/ndellin/trilinos/Trilinos/packages/muelu/test/unit_tests/Regression.cpp:109
...

MueLu_UnitTestsTpetra_kokkos_MPI_1 failing checks:

59: The following tests FAILED:
59:     128. Regression_double_int_longlong_Tpetra_KokkosCompat_KokkosCudaWrapperNode_H2D_UnitTest ...
59:     129. Regression_double_int_longlong_Tpetra_KokkosCompat_KokkosCudaWrapperNode_Aggregration_UnitTest ...
59:     132. Regression_std_complex0double0_int_longlong_Tpetra_KokkosCompat_KokkosCudaWrapperNode_H2D_UnitTest ...
59:     133. Regression_std_complex0double0_int_longlong_Tpetra_KokkosCompat_KokkosCudaWrapperNode_Aggregration_UnitTest ...

MueLu_UnitTestsTpetra_kokkos_MPI_4 failing checks:

60: The following tests FAILED:
60:     129. Regression_double_int_longlong_Tpetra_KokkosCompat_KokkosCudaWrapperNode_Aggregration_UnitTest ...
60:     133. Regression_std_complex0double0_int_longlong_Tpetra_KokkosCompat_KokkosCudaWrapperNode_Aggregration_UnitTest ...

All the subtest failures seem related to Tpetra::Details::DeepCopyCounter::get_count_different_space() discrepancies

Steps to Reproduce

  1. SHA1: 70242e00d7dd5a4a44792e1670a4b9edb74269aa
  2. Configuration: Weaver rhel8 queue

Interactive node

bsub -Is -n 1 -q rhel8 -gpu "num=4" bash

Environment

export TRILINOS_DIR= export KOKKOS_PATH=$TRILINOS_DIR/packages/kokkos export ATDM_CONFIG_REGISTER_CUSTOM_CONFIG_DIR=${TRILINOS_DIR}/cmake/std/atdm/contributed/weaver source ${TRILINOS_DIR}/cmake/std/atdm/load-env.sh weaver-cuda-11.2-opt export OMPI_CXX="$KOKKOS_PATH/bin/nvcc_wrapper"

Configure

cmake \ -DCMAKE_CXX_FLAGS='-g' \ -DCMAKE_CXX_STANDARD="17" \ -DCMAKE_INSTALL_PREFIX=$PWD/install \ -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \ -DTrilinos_ENABLE_COMPLEX_DOUBLE=ON \ -DTrilinos_ENABLE_TESTS=OFF \ -DTrilinos_ENABLE_ALL_PACKAGES=OFF \ -DTPL_ENABLE_CUSPARSE:BOOL=ON \ \ -D Trilinos_ENABLE_Kokkos=ON \ -D Kokkos_ARCH_VOLTA70=ON \ -D Kokkos_ARCH_POWER9=ON \ -D Kokkos_ENABLE_CUDA=ON \ -D Kokkos_ENABLE_CUDA_LAMBDA=ON \ -D Kokkos_ENABLE_CUDA_UVM=ON \ -D KokkosKernels_INST_MEMSPACE_CUDAUVMSPACE=ON \ -D Tpetra_ALLOCATE_IN_SHARED_SPACE=ON \ -D Trilinos_ENABLE_Sacado=ON \ -D Trilinos_ENABLE_Phalanx=ON \ -D Trilinos_ENABLE_Ifpack2=ON \ -D Trilinos_ENABLE_MueLu=ON \ -D MueLu_ENABLE_TESTS=ON \ \ $TRILINOS_DIR

github-actions[bot] commented 7 months ago

Automatic mention of the @trilinos/muelu team

github-actions[bot] commented 7 months ago

Automatic mention of the @trilinos/muelu team

ndellingwood commented 7 months ago

Updated to add MueLu_UnitTestsTpetra_kokkos_MPI_{1,4} tests; all the subtest failures seem related to Tpetra::Details::DeepCopyCounter::get_count_different_space() discrepancies

ndellingwood commented 7 months ago

Hi @cgcgcg , unfortunately I am still seeing some failures in Cuda w/UVM builds after merge of #12866

MueLu_UnitTestsTpetra_MPI_1

....
20:22:35  2 = 2 == H->GetGlobalNumLevels() = 2 : passed
20:22:35  Tpetra::Details::DeepCopyCounter::get_count_different_space() = 20 == 37 = 37 : FAILED ==> /home/jenkins/weaver/workspace/KokkosEco_Trilinos_Weaver_CUDA112_opt-uvm/Trilinos/packages/muelu/test/unit_tests/Regression.cpp:109
20:22:35  Tpetra::Details::DeepCopyCounter::get_count_different_space() = 2 == 2 = 2 : passed
...
20:22:35 The following tests FAILED:
20:22:35     170. Regression_double_int_longlong_Tpetra_KokkosCompat_KokkosCudaWrapperNode_H2D_UnitTest ... 

MueLu_UnitTestsTpetra_kokkos_MPI_1

...
20:22:45  2 = 2 == H->GetGlobalNumLevels() = 2 : passed
20:22:45  Tpetra::Details::DeepCopyCounter::get_count_different_space() = 27 == targetNumDeepCopies = 31 : FAILED ==> /home/jenkins/weaver/workspace/KokkosEco_Trilinos_Weaver_CUDA112_opt-uvm/Trilinos/packages/muelu/test/unit_tests_kokkos/Regression.cpp:119
20:22:45  Tpetra::Details::DeepCopyCounter::get_count_different_space() = 2 == 2 = 2 : passed
...
20:22:46 The following tests FAILED:
20:22:46     32. Regression_double_int_longlong_Tpetra_KokkosCompat_KokkosCudaWrapperNode_H2D_UnitTest ... 
cgcgcg commented 7 months ago

@ndellingwood Sorry about that. I totally only updated the counts for one of the regression tests. See #12874. However, it seems that we are still getting different deep_copy counts. This could either be a difference in how Trilinos is configured that's not being taken into account in the tests, or the new Kokkos release somehow uses less deep_copies.

ndellingwood commented 6 months ago

@cgcgcg thanks for the update

This could either be a difference in how Trilinos is configured that's not being taken into account in the tests, or the new Kokkos release somehow uses less deep_copies.

The failures occurred with the existing (4.2.01) kokkos and kokkos-kernels packages in Trilinos, it was not unique to the release candidates