trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.19k stars 565 forks source link

Stokhos: Stokhos_TpetraCrsMatrixUQPCEUnitTest_Cuda_MPI_4 unit test failure in Cuda build #12427

Open ndellingwood opened 11 months ago

ndellingwood commented 11 months ago

Bug Report

@trilinos/stokhos @etphipp

Description

The Stokhos_TpetraCrsMatrixUQPCEUnitTest_Cuda_MPI_4 unit test fails in cuda/11.4.2 builds with the following output:

...
317: Cuda Runtime Configuration:
317: macro  KOKKOS_ENABLE_CUDA      : defined
317: macro  CUDA_VERSION          = 11040 = version 11.4
317: Kokkos::Cuda[ 0 ] Tesla V100-PCIE-16GB capability 7.0, Total Global Memory: 15.77 G, Shared Memory per Block: 48 K : Selected
317: Kokkos::Cuda[ 1 ] Tesla V100-PCIE-16GB capability 7.0, Total Global Memory: 15.77 G, Shared Memory per Block: 48 K
317:   Kokkos Version: 4.1.0
317: Compiler:
317:   KOKKOS_COMPILER_GNU: 850
317:   KOKKOS_COMPILER_NVCC: 1140
317: Architecture:
317:   CPU architecture: none
317:   Default Device: N6Kokkos4CudaE
317:   GPU architecture: VOLTA70
317:   platform: 64bit
317: Atomics:
317: Vectorization:
317:   KOKKOS_ENABLE_PRAGMA_IVDEP: no
317:   KOKKOS_ENABLE_PRAGMA_LOOPCOUNT: no
317:   KOKKOS_ENABLE_PRAGMA_UNROLL: no
317:   KOKKOS_ENABLE_PRAGMA_VECTOR: no
317: Memory:
317:   KOKKOS_ENABLE_HBWSPACE: no
317:   KOKKOS_ENABLE_INTEL_MM_ALLOC: no
317: Options:
317:   KOKKOS_ENABLE_ASM: yes
317:   KOKKOS_ENABLE_CXX17: yes
317:   KOKKOS_ENABLE_CXX20: no
317:   KOKKOS_ENABLE_CXX23: no
317:   KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK: no
317:   KOKKOS_ENABLE_HWLOC: no
317:   KOKKOS_ENABLE_LIBDL: yes
317:   KOKKOS_ENABLE_LIBRT: no
317: Host Serial Execution Space:
317:   KOKKOS_ENABLE_SERIAL: yes
317: 
317: Serial Runtime Configuration:
317: Device Execution Space:
317:   KOKKOS_ENABLE_CUDA: yes
317: Cuda Options:
317:   KOKKOS_ENABLE_CUDA_LAMBDA: yes
317:   KOKKOS_ENABLE_CUDA_LDG_INTRINSIC: yes
317:   KOKKOS_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE: no
317:   KOKKOS_ENABLE_CUDA_UVM: no
317:   KOKKOS_ENABLE_CXX11_DISPATCH_LAMBDA: yes
317: 
317: Cuda Runtime Configuration:
317: macro  KOKKOS_ENABLE_CUDA      : defined
317: macro  CUDA_VERSION          = 11040 = version 11.4
317: Kokkos::Cuda[ 0 ] Tesla V100-PCIE-16GB capability 7.0, Total Global Memory: 15.77 G, Shared Memory per Block: 48 K : Selected
317: Kokkos::Cuda[ 1 ] Tesla V100-PCIE-16GB capability 7.0, Total Global Memory: 15.77 G, Shared Memory per Block: 48 K
317: 1. Tpetra_CrsMatrix_PCE_DS_default_local_ordinal_type_default_global_ordinal_type_CudaWrapperNode_VectorDot_UnitTest ... [Passed] (0.00822 sec)
317: 2. Tpetra_CrsMatrix_PCE_DS_default_local_ordinal_type_default_global_ordinal_type_CudaWrapperNode_MultiVectorAdd_UnitTest ... [Passed] (0.0094 sec)
317: 3. Tpetra_CrsMatrix_PCE_DS_default_local_ordinal_type_default_global_ordinal_type_CudaWrapperNode_MultiVectorDot_UnitTest ... [Passed] (0.00905 sec)
317: 4. Tpetra_CrsMatrix_PCE_DS_default_local_ordinal_type_default_global_ordinal_type_CudaWrapperNode_MultiVectorDotSub_UnitTest ... [Passed] (0.00836 sec)
317: 5. Tpetra_CrsMatrix_PCE_DS_default_local_ordinal_type_default_global_ordinal_type_CudaWrapperNode_MatrixVectorMultiply_UnitTest ... [Passed] (0.0266 sec)
317: 6. Tpetra_CrsMatrix_PCE_DS_default_local_ordinal_type_default_global_ordinal_type_CudaWrapperNode_MatrixMultiVectorMultiply_UnitTest ... [Passed] (0.0231 sec)
317: 7. Tpetra_CrsMatrix_PCE_DS_default_local_ordinal_type_default_global_ordinal_type_CudaWrapperNode_Flatten_UnitTest ... [Passed] (0.409 sec)
317: 8. Tpetra_CrsMatrix_PCE_DS_default_local_ordinal_type_default_global_ordinal_type_CudaWrapperNode_SimpleCG_UnitTest ... [Passed] (0.439 sec)
317: 9. Tpetra_CrsMatrix_PCE_DS_default_local_ordinal_type_default_global_ordinal_type_CudaWrapperNode_SimplePCG_Muelu_UnitTest ... [Passed] (1.75e-06 sec)
317: 10. Tpetra_CrsMatrix_PCE_DS_default_local_ordinal_type_default_global_ordinal_type_CudaWrapperNode_BelosGMRES_UnitTest ... [Passed] (0.872 sec)
317: 11. Tpetra_CrsMatrix_PCE_DS_default_local_ordinal_type_default_global_ordinal_type_CudaWrapperNode_BelosGMRES_RILUK_UnitTest ... [Passed] (0.619 sec)
317: 12. Tpetra_CrsMatrix_PCE_DS_default_local_ordinal_type_default_global_ordinal_type_CudaWrapperNode_BelosCG_Muelu_UnitTest ... [Passed] (1.74e-06 sec)
317: 13. Tpetra_CrsMatrix_PCE_DS_default_local_ordinal_type_default_global_ordinal_type_CudaWrapperNode_Amesos2_UnitTest ... [kokkos-dev-2:238548] *** An error occurred in MPI_Allreduce
317: [kokkos-dev-2:238548] *** reported by process [3896508417,2]
317: [kokkos-dev-2:238548] *** on communicator MPI_COMM_WORLD
317: [kokkos-dev-2:238548] *** MPI_ERR_TRUNCATE: message truncated
317: [kokkos-dev-2:238548] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
317: [kokkos-dev-2:238548] ***    and potentially your MPI job)
317: [kokkos-dev-2:238542] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
317: [kokkos-dev-2:238542] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
1/1 Test #317: Stokhos_TpetraCrsMatrixUQPCEUnitTest_Cuda_MPI_4 ...***Failed  Required regular expression not found. Regex=[End Result: TEST PASSED
]  6.52 sec

0% tests passed, 1 tests failed out of 1

Subproject Time Summary:
Stokhos    =  26.09 sec*proc (1 test)

Total Test time (real) =   6.58 sec

The following tests FAILED:
    317 - Stokhos_TpetraCrsMatrixUQPCEUnitTest_Cuda_MPI_4 (Failed)

Steps to Reproduce

  1. SHA1: 22445c8febb4ba87bb6f2819e3b0ea8f4c79599b
  2. Configuration (sems modules)
    
    module purge
    module load sems-gcc/8.5.0 sems-cuda/11.4.2 sems-openmpi/4.1.4 sems-cmake sems-git sems-ninja
    KOKKOS_DIR=$TRILINOS_DIR/packages//kokkos
    export OMPI_CXX=$KOKKOS_DIR/bin/nvcc_wrapper

cmake \ -GNinja \ -D CMAKE_INSTALL_PREFIX="${TRILINOS_INSTALL_DIR}" \ -D CMAKE_BUILD_TYPE:STRING=RELEASE \ -D BUILD_SHARED_LIBS:BOOL=OFF \ -DTPL_ENABLE_MPI=ON \ -DTPL_ENABLE_BLAS:BOOL=ON \ -DTPL_ENABLE_LAPACK:BOOL=ON \ -DTPL_ENABLE_CUSPARSE:BOOL=ON \ -DTrilinos_ENABLE_ALL_PACKAGES=OFF \ -DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES=OFF \ -DTrilinos_ENABLE_TESTS=OFF \ -DTrilinos_MUST_FIND_ALL_TPL_LIBS=TRUE \ -DTrilinos_ENABLE_COMPLEX=ON \ -DTrilinos_ENABLE_OpenMP=OFF \ -DTrilinos_ENABLE_Kokkos=ON \ -D Kokkos_ENABLE_SERIAL=ON \ -D Kokkos_ENABLE_CUDA=ON \ -D Kokkos_ARCH_VOLTA70=ON \ -DTrilinos_ENABLE_KokkosKernels=ON \ -DTrilinos_ENABLE_Tpetra=ON \ -D Tpetra_ENABLE_TESTS=ON \ -DTrilinos_ENABLE_Sacado=ON \ -DTrilinos_ENABLE_Stokhos=ON \ -D Stokhos_ENABLE_TESTS=ON \ -DTrilinos_ENABLE_Ifpack2=ON \ -DTrilinos_ENABLE_Amesos2=ON \ \ -DTPL_ENABLE_Matio=OFF \ \ $TRILINOS_DIR

etphipp commented 10 months ago

Thanks for reporting. I am strongly considering getting rid of the PCE scalar type, which this test uses. It would make everyone's life much easier!

brian-kelley commented 6 months ago

When testing #12818, I saw this failure with develop branch, but I also see the MPVector one failing (Stokhos_TpetraCrsMatrixMPVectorUnitTest_Cuda_MPI_4). It was with the exact environment and config @ndellingwood used here.