trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.18k stars 559 forks source link

Tpetra: Sycl test failures on Ponte Vecchio #12295

Open ndellingwood opened 9 months ago

ndellingwood commented 9 months ago

Bug Report

@trilinos/tpetra

Description

I tested out a Sycl configuration on new Blake's Ponte Vecchio GPUs and with Daniel's PR #12294 updates, the following tests failed with seg faults:

The following tests FAILED:
  127 - TpetraCore_TpetraUtils_WrappedDualView (SEGFAULT)
  139 - TpetraCore_getEntryOnHost (SEGFAULT)
  157 - TpetraCore_BlockCrsPerfTest (SEGFAULT)

Steps to Reproduce

Use changes with #12294

Configuration (New) Blake PV queue:

# Interactive node
salloc -N 1 -p PV

# Environment
module load cmake intel-oneapi-compilers/2023.1.0 intel-oneapi-dpl/2022.1.0 git intel-oneapi-mkl/2023.1.0

# Configuration
cmake \
  -D CMAKE_CXX_COMPILER="/projects/x86-64-icelake-rocky8/compilers/intel-oneapi-compilers/2023.1.0/gcc/8.5.0/base/6g2jkiv/compiler/2023.1.0/linux/bin-llvm/clang++" \
  -D CMAKE_C_COMPILER="/projects/x86-64-icelake-rocky8/compilers/intel-oneapi-compilers/2023.1.0/gcc/8.5.0/base/6g2jkiv/compiler/2023.1.0/linux/bin-llvm/clang" \
  -D CMAKE_Fortran_COMPILER="`which gfortran`" \
  -D CMAKE_CXX_FLAGS="-g -fp-model=precise" \
  -D CMAKE_C_FLAGS="-g" \
  -D BUILD_SHARED_LIBS=ON \
  -DTPL_ENABLE_MPI=OFF \
  -DTPL_ENABLE_BLAS:BOOL=ON \
   -DBLAS_LIBRARY_DIRS=$MKLROOT/lib/intel64 \
   -DBLAS_LIBRARY_NAMES=mkl_rt \
  -DTPL_ENABLE_LAPACK:BOOL=ON \
   -DLAPACK_LIBRARY_DIRS=$MKLROOT/lib/intel64 \
   -DLAPACK_LIBRARY_NAMES=mkl_rt \
  -DTPL_ENABLE_MKL:BOOL=ON \
   -DMKL_INCLUDE_DIRS=$MKLROOT/include \
   -DMKL_LIBRARY_DIRS=$MKLROOT/lib/intel64 \
   -DMKL_LIBRARY_NAMES=mkl_rt \
  -DTrilinos_ENABLE_ALL_PACKAGES=OFF \
  -DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES=OFF \
  -DTrilinos_ENABLE_TESTS=ON \
  -DTrilinos_MUST_FIND_ALL_TPL_LIBS=TRUE \
  -DTrilinos_ENABLE_OpenMP=OFF \
  -DTrilinos_ENABLE_Kokkos=ON \
  -D Kokkos_ENABLE_SYCL=ON \
   -D Kokkos_ENABLE_TESTS=OFF \
   -D Kokkos_ENABLE_ONEDPL=OFF \
  -D Kokkos_ARCH_INTEL_PVC=ON \
  -DTrilinos_ENABLE_KokkosKernels=ON \
   -D KokkosKernels_ENABLE_TESTS=OFF \
  -DTrilinos_ENABLE_Tpetra=ON \
  -D Tpetra_INST_SYCL=ON \
  -D Tpetra_INST_SERIAL=ON \
   -D Tpetra_ENABLE_TESTS=ON \
\
  -DTPL_ENABLE_Matio=OFF \
\
$TRILINOS_DIR
csiefer2 commented 9 months ago

@ndellingwood Blake compilers won't build squat. Both 2023.1 and 2023.2 are missing ocloc and LevelZero. @fryeguy52 is fixing the compilers. Will try to reproduce once he's done.

ndellingwood commented 9 months ago

Thanks for the info @csiefer2 , sorry for any added noise with this issue

masterleinad commented 8 months ago

@ndellingwood Feel free to add me to SYCL issues in Trilinos.

ndellingwood commented 8 months ago

@masterleinad sure thing. I'm putting in a printf fix shortly (just in case you're standing up a build and run into it). Regarding this issue, I should also point out a mistaken assumption in my configuration, I assumed Tpetra would enable SYCL based on Kokkos_ENABLE_SYCL=ON, but looking at the configure output I needed to enable SYCL for Tpetra explicitly. Rebuilding for a retest

ndellingwood commented 8 months ago

With local changes in PR #12471 and setting Tpetra_INST_SYCL=ON, this is the set of test failures:

The following tests FAILED:
     19 - TpetraCore_BlockCrsMatrix (Failed)
     82 - TpetraCore_ImportExport2_UnitTests_Send (Failed)
     83 - TpetraCore_ImportExport2_UnitTests_ISend (Failed)
     84 - TpetraCore_ImportExport2_UnitTests_Alltoall (Failed)
    140 - TpetraCore_getEntryOnHost (Failed)
Errors while running CTest
masterleinad commented 8 months ago

All of these tests are passing for me on the Intel testbeds.

ndellingwood commented 8 months ago

@masterleinad which version of intel/oneapi and which architecture did you test?

masterleinad commented 8 months ago

@masterleinad which version of intel/oneapi and which architecture did you test?

oneapi/eng-compiler/2023.10.15.002 with Kokkos_ENABLE_SERIAL=ON, Kokkos_ENABLE_SYCL=ON and Kokkos_ARCH_INTEL_PVC=ON.

masterleinad commented 8 months ago

That compiler is tagged as 2024.0.0.

ndellingwood commented 8 months ago

@masterleinad did you add Tpetra_INST_SYCL=ON explicitly? If not, can you look over the configure output to confirm that SYCL was enabled for Tpetra?

For reference, I initially had not set that and had this warning in the configure output:

-- NOTE: Kokkos::SYCL is ON (the CMake option Kokkos_ENABLE_SYCL is ON), but the corresponding Tpetra Node type is disabled.  If you want to enable instantiation and use of Kokkos::SYCL in Tpetra, please also set the CMake option Tpetra_INST_SYCL:BOOL=ON.  If you use the Kokkos::SYCL version of Tpetra without doing this, you will get link errors!
-- Determine whether Tpetra will assume that MPI is GPU aware:
--   - Tpetra_INST_CUDA, Tpetra_INST_HIP and Tpetra_INST_SYCL atre OFF, so Tpetra will assume that MPI is not GPU aware.
-- Tpetra execution space availability (ON means available): 
--   - Serial:  ON 
--   - Threads: OFF
--   - OpenMP:  OFF
--   - Cuda:    OFF
--   - HIP:     OFF
--   - SYCL:    OFF
masterleinad commented 8 months ago

@masterleinad did you add Tpetra_INST_SYCL=ON explicitly? If not, can you look over the configure output to confirm that SYCL was enabled for Tpetra?

Yes, it was set and I am seeing

[...]
-- Tpetra: Using internal Kokkos
-- Tpetra: Enabling deprecated code
-- Determine whether Tpetra will assume that MPI is GPU aware:
--   - TPL_ENABLE_MPI is OFF, so we assume that (nonexistent) MPI is not GPU aware.
-- Tpetra execution space availability (ON means available): 
--   - Serial:  ON
--   - Threads: OFF
--   - OpenMP:  OFF
--   - Cuda:    OFF
--   - HIP:     OFF
--   - SYCL:    ON
-- Tpetra: Tpetra_INST_INT_LONG_LONG is enabled by default.
-- Tpetra: Tpetra_INST_INT_UNSIGNED is disabled by default.
-- Tpetra: Tpetra_INST_INT_UNSIGNED_LONG is disabled by default.
-- Tpetra: Tpetra_INST_INT_INT is disabled by default.
-- Tpetra: Tpetra_INST_INT_LONG is disabled by default.
-- 
-- Tpetra: Validate global ordinal setting ...
-- Tpetra: global ordinal setting is OK
[...]
ndellingwood commented 8 months ago

@masterleinad thanks! Can you post your configuration as well? I'd like to compare to see if I have misconfigured, but happened to get a complete build

masterleinad commented 8 months ago

I tried again with the configuration posted in the pull request description (https://github.com/trilinos/Trilinos/issues/12295#issue-1905712874) and see

TpetraCore_TpetraUtils_WrappedDualView (Failed)
TpetraCore_getEntryOnHost (Failed)

with MKL and see

TpetraCore_CrsMatrix_2DRandomDist

timing out. Previously (https://github.com/trilinos/Trilinos/issues/12295#issuecomment-1791383088) when I saw all tests passing, I was also pulling in Kokkos develop.