kokkos / kokkos-kernels

Kokkos C++ Performance Portability Programming Ecosystem: Math Kernels - Provides BLAS, Sparse BLAS and Graph Kernels
Other
303 stars 96 forks source link

Nightly Trilinos failure with Cuda/11.2.2 non-UVM builds, MueLu, Panzer unit tests #2015

Closed ndellingwood closed 5 months ago

ndellingwood commented 11 months ago

Nightly cuda/11.2.2 builds (no UVM) are failing in the following unit tests with kokkos-kernels@develop:

03:05:58 The following tests FAILED:
03:05:58    1784 - MueLu_UnitTestsBlockedTpetra_MPI_1 (Failed)
03:05:58    1785 - MueLu_UnitTestsBlockedTpetra_MPI_4 (Failed)
03:05:58    1838 - MueLu_MeshTyingBlocked_SimpleSmoother_MPI_4 (Failed)
03:05:58    1841 - MueLu_MeshTyingBlocked_SimpleSmoother_2dof_medium_MPI_4 (Failed)
03:05:58    2286 - PanzerAdaptersSTK_tDOFManager2_SimpleTests_MPI_4 (Failed)
03:05:58    2371 - PanzerMiniEM_MiniEM-BlockPrec_RefMaxwell_reuse_MPI_4 (Failed)
03:05:58    2372 - PanzerMiniEM_MiniEM-BlockPrec_RefMaxwell2D_MPI_4 (Failed)
03:05:58    2373 - PanzerMiniEM_MiniEM-BlockPrec_MueLu_highOrder_0_MPI_4 (Failed)

https://jenkins-son.sandia.gov/job/KokkosEco_Trilinos_Weaver_CUDA112_opt-no-uvm/257

The PanzerMiniEM_MiniEM-BlockPrec_MueLu_highOrder_0_MPI_4 was previously reported in #2010 and is failing with release-candidate-4.2.00 as well. The other tests began failing after merge of the following commit:

Sparse: fix cusparse spgemm hang properly (detail) Sparse: fix logic for bad cursparse spgemm version. (detail) Improvements on the unification attempt logic for axpby(), including new tests (detail) Addressing feedbacks from Luc, plus some small changes here and there: (detail) Formatting (detail) Using 'ifdef HAVE_KOKKOSKERNELS_DEBUG', per Luc's suggestion (detail) Addressing feedbacks from Luc (detail) Correcting compilation errors in my Mac (detail) Backup (detail) CUDA 11.0.1 / cuSPARSE 11.0.0 changed SpMM enums (detail) CUDA 11.2.1 / cuSPARSE 11.4.0 changed SpMV (detail)

Reproducer (weaver rhel8):

# Repos
git clone -b kokkos-promotion https://github.com/trilinos/Trilinos.git
git clone -b develop https://github.com/kokkos/kokkos.git
git clone -b develop https://github.com/kokkos/kokkos-kernels.git

# Symbolic link to external kokkos and kokkos-kernels repos in Trilinos source directory for source override
cd Trilinos
ln -s <path-to-your-repo>/kokkos kokkos
ln -s <path-to-your-repo>/kokkos-kernels kokkos-kernels
cd ..

# Create build and local tmp directories
mkdir -p build
cd build

export TEMPDIR=$PWD/tmp_cuda
export TMPDIR=$TEMPDIR
mkdir -p $TMPDIR

# Interactive node
bsub -Is -n 1 -q rhel8 -gpu "num=4" bash

# Environment setup
source /etc/profile.d/modules.sh
module purge
source /projects/ppc64le-pwr9-rhel8/legacy-env.sh

export ATDM_CONFIG_REGISTER_CUSTOM_CONFIG_DIR=${TRILINOS_DIR}/cmake/std/atdm/contributed/weaver
source ${TRILINOS_DIR}/cmake/std/atdm/load-env.sh weaver-cuda-11.2-opt
export OMPI_CXX="$KOKKOS_DIR/bin/nvcc_wrapper"

# Cmake config
cmake \
      -D CMAKE_CXX_FLAGS='-g' \
      -D CMAKE_CXX_STANDARD="17" \
      -D CMAKE_INSTALL_PREFIX=$PWD/install \
      -D Trilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
      -DTrilinos_ENABLE_TESTS=OFF \
      -DTrilinos_ENABLE_ALL_PACKAGES=OFF \
      -D Trilinos_ENABLE_Kokkos=ON \
      -D Kokkos_ARCH_VOLTA70=ON \
      -D Kokkos_ARCH_POWER9=ON \
      -D Kokkos_ENABLE_CUDA=ON \
      -D Kokkos_ENABLE_CUDA_LAMBDA=ON \
      -D Kokkos_ENABLE_CUDA_UVM=OFF \
      -D Kokkos_ENABLE_IMPL_CUDA_MALLOC_ASYNC=OFF \
      -DTrilinos_ENABLE_Stokhos=ON \
      -D TPL_ENABLE_CUSPARSE:BOOL=ON \
      -DTrilinos_ENABLE_COMPLEX_DOUBLE=ON \
      -DTrilinos_ENABLE_MueLu=ON \
      -D MueLu_ENABLE_TESTS=ON \
      -DTrilinos_ENABLE_Panzer=ON \
      -D Panzer_ENABLE_TESTS=ON \
      -D Panzer_ENABLE_EXAMPLES=ON \
      -DKokkos_SOURCE_DIR_OVERRIDE:STRING=kokkos \
      -DKokkosKernels_SOURCE_DIR_OVERRIDE:STRING=kokkos-kernels \
$TRILINOS_DIR

# Build
make -j16

# Failing test
ctest
lucbv commented 11 months ago

Blocked seems to be a theme in the failing unit-tests but I'm not sure these are the small blocks of a BsrMatrix.

ndellingwood commented 11 months ago

PR's corresponding to the commit list:

@lucbv since the failures are block-related, I'll start triage with a revert of #2008 to see the impact on the tests. MueLu builds take awhile with cuda, so it'll be awhile before I have the breaking change pinpointed

ndellingwood commented 11 months ago

@lucbv revert of #2008 did not resolve the MueLu failures. Rebuilding with a revert of #2011 to retest

lucbv commented 11 months ago

Okay, I'm glad #2008 did not generate the issues but unfortunately that will take you a little longer to get to the bottom of it. Except for #1895 the other 3 PRs are fairly light in terms of changes so if they trigger the problem it should still be easy to fix.

ndellingwood commented 11 months ago

Revert of #2011 and #2012 did not help with the MueLu tests, they still failed. Rebuilding with revert of #1895

ndellingwood commented 11 months ago

Revert of #1895 returned MueLu tests to passing

ndellingwood commented 10 months ago

Addressed by #2039, thanks @eeprude !