White Cuda Runtime Failure

ndellingwood commented 6 years ago

On White with a Cuda/9.2 bulid the following test failure occurs:

[ RUN      ] cuda.gemm_double
unknown file: Failure
C++ exception with description "Kokkos::Impl::ParallelFor< Cuda > requested too large team size.
Traceback functionality not available
" thrown in the test body.
[  FAILED  ] cuda.gemm_double (104 ms)
[ RUN      ] cuda.sparse_spgemm_double_int_int_TestExecSpace
Segmentation fault (core dumped)

Modules loaded:

Currently Loaded Modulefiles:
  1) git/2.10.1        2) cmake/3.9.6       3) binutils/2.30.0   4) gcc/7.2.0         5) cuda/9.2.88       6) ibm/xl/16.1.1

Queue and Arch: rhel7F queue (Kepler K80 GPUs) Power8,Kepler37 archs

Edit: Adding SHAs and script for generating makefiles

Kokkos SHA: kokkos/kokkos@5fe980f4b6d672cd58aab0e8685d8e1dfc9f809e KokkosKernels SHA: 9a9845e14234eb8f8edb7c1331c9a228eb24048d

Script:

KOKKOS_PATH=${HOME}/kokkos #path to kokkos source
KOKKOSKERNELS_SCALARS=double #the scalar types to instantiate =double,float...
KOKKOSKERNELS_LAYOUTS=LayoutLeft #the layout types to instantiate.
KOKKOSKERNELS_ORDINALS=int #ordinal types to instantiate
KOKKOSKERNELS_OFFSETS=int #offset types to instantiate
KOKKOSKERNELS_PATH=../.. #path to kokkos-kernels top directory.
CXX=${KOKKOS_PATH}/bin/nvcc_wrapper #icpc #
KOKKOSKERNELS_OPTIONS=eti-only #options for kokkoskernels
KOKKOS_DEVICES="Cuda,Serial"
KOKKOS_ARCHS="Power8,Kepler37"
CXXFLAGS="-pedantic -O3 -g -Wshadow -Wsign-compare -Wtype-limits -Wuninitialized"

../../scripts/generate_makefile.bash --kokkoskernels-path=${KOKKOSKERNELS_PATH} --with-scalars=${KOKKOSKERNELS_SCALARS} --with-ordinals=${KOKKOSKERNELS_ORDINALS} --with-offsets=${KOKKOSKERNELS_OFFSETS} --kokkos-path=${KOKKOS_PATH} --with-devices=${KOKKOS_DEVICES} --arch=${KOKKOS_ARCHS} --compiler=${CXX} --with-options=${KOKKOSKERNELS_OPTIONS}  --cxxflags="${CXXFLAGS}"

Reproducing instructions:

Created testing sub-directory within KokkosKernels kokkos-kernels/testing/White-Cuda9.2
make install-lib -j16
cd unit_test
make -j32
run tests ./KokkosKernels_UnitTest_Cuda

srajama1 commented 6 years ago

Does it change based on Cuda version ? We have nightlies on white that don't have this problem.

ndellingwood commented 6 years ago

@srajama1 I'll check with cuda/9.0 and hop over to the pascal queue and do some testing there.

ndellingwood commented 6 years ago

@srajama1 same error using cuda/9.0.176 + gcc/6.4.0 (gcc/7.2.0 is not compatible with our cuda/9.0.x compilers).

[ RUN      ] cuda.gemm_double
unknown file: Failure
C++ exception with description "Kokkos::Impl::ParallelFor< Cuda > requested too large team size.
Traceback functionality not available
" thrown in the test body.
[  FAILED  ] cuda.gemm_double (110 ms)
[ RUN      ] cuda.sparse_spgemm_double_int_int_TestExecSpace
Segmentation fault (core dumped)

ndellingwood commented 6 years ago

@srajama1 The description:

C++ exception with description "Kokkos::Impl::ParallelFor< Cuda > requested too large team size.

possibly suggests that a team_size is hard-coded too large for the hardware, which currently should silently adjust to the max possible team_size unless deprecated code is disabled in which case this behavior is not allowed, but I did not pass a flag to disable deprecated code... maybe the switch was flipped in the kokkos develop branch to disable deprecated code by default. If so I can start looking at that as the cause, though I thought I caught all these cases in KokkosKernels awhile back...

ndellingwood commented 6 years ago

Reference output using the kernel-logger from the kokkos-tools to help pinpoint where error occurs:


KokkosP: Entering profiling region: KokkosBlas::gemm[ETI]
KokkosP: Executing parallel-for kernel on device 0 with unique execution identifier 18
KokkosP: KokkosBlas::Test::gemm_double
KokkosP:   KokkosBlas::gemm[ETI]
KokkosP:     KokkosBlas::gemm[NN]
KokkosP: Deallocate<Cuda> name: Kokkos::Random_XorShift64::state pointer: 0x8304820080 size: 1048576
KokkosP: Deallocate<Cuda> name: Kokkos::Random_XorShift64::locks pointer: 0x8304720080 size: 524288
KokkosP: Deallocate<Cuda> name: C pointer: 0x8304522880 size: 1560
KokkosP: Deallocate<Cuda> name: C pointer: 0x8304522080 size: 1560
KokkosP: Deallocate<Cuda> name: B pointer: 0x8304521680 size: 2040
KokkosP: Deallocate<Cuda> name: A pointer: 0x8304520e80 size: 1768
unknown file: Failure
C++ exception with description "Kokkos::Impl::ParallelFor< Cuda > requested too large team size.
Traceback functionality not available
" thrown in the test body.
[  FAILED  ] cuda.gemm_double (9 ms)

Edit: More reference notes KokkosP: KokkosBlas::gemm[NN] may indicate this is dying somewhere in src/blas/impl/KokkosBlas3_gemm_impl.hpp

The unittest impl code itself uses parallel* patterns with team_policy that uses Kokkos::AUTO for the team_size, so in this location my guess about behavior changes with team_size and disabled deprecated code is wrong.

ndellingwood commented 6 years ago

Update: In src/blas/impl/KokkosBlas3_gemm_impl.hpp there is a GEMMImpl<...>::run(...) method that takes team_size as a runtime argument and then uses this to construct a team_policy.

src/blas/impl/KokkosBlas3_gemm_spec.hpp contains the GEMM<...>::gemm(...) calls the above run method where the following info is used to set the team_size:

line 138: static constexpr int blockA0 = 24; ... line 154:

  // Figure out Team Sizes
  int team_size = 1;
  #if defined(KOKKOS_ENABLE_CUDA)
  if(std::is_same<typename CViewType::execution_space,Kokkos::Cuda>::value)
    team_size = blockA0;
  #endif

@srajama1 @crtrott does this algorithm require that the team_size be hard coded to a particular value, or is this value chosen based on performance? I don't want to blindly replace the team size with Kokkos::AUTO without knowing better how it is used by the algorithm.

Edit: TeamPolicy construction plus parallel_for call in run of src/blas/impl/KokkosBlas3_gemm_impl.hpp beginning line 454

Kokkos::TeamPolicy<ExecSpace,Kokkos::LaunchBounds<384,2>> policy(num_blocks_0*num_blocks_1,team_size,vector_length);

Kokkos::parallel_for(impl_gemm_label<TransposeA,TransposeB>::label,policy.set_scratch_size(scratch_level,Kokkos::PerTeam(scratch_memory_size)),*this);

ndellingwood commented 6 years ago

This is a verbose message but reference step info from cuda-gdb the last point in GEMMImpl until the code fails:

(cuda-gdb)
Kokkos::Impl::CudaGetMaxBlockSize<Kokkos::Impl::ParallelFor<KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0>, Kokkos::TeamPolicy<Kokkos::Cuda, Kokkos::LaunchBounds<384u, 2u> >, Kokkos::Cuda>, Kokkos::LaunchBounds<384u, 2u>, false>::get_block_size (shmem_extra_thread=<optimized out>, shmem_extra_block=<optimized out>, vector_length=16, f=...)
    at /ascldap/users/ndellin/kokkos-kernels/testing/White-Cuda9.0-Kepler/kokkos/install/include/Cuda/Kokkos_Cuda_Internal.hpp:176
176       sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
(cuda-gdb)
181       cudaOccupancyMaxActiveBlocksPerMultiprocessor(
(cuda-gdb)
cudaOccupancyMaxActiveBlocksPerMultiprocessor<void (*)(Kokkos::Impl::ParallelFor<KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0>, Kokkos::TeamPolicy<Kokkos::Cuda, Kokkos::LaunchBounds<384u, 2u> >, Kokkos::Cuda>)> (dynamicSMemSize=23552,
    blockSize=256,
    func=0x10349b40 <Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0>, Kokkos::TeamPolicy<Kokkos::Cuda, Kokkos::LaunchBounds<384u, 2u> >, Kokkos::Cuda>, 384u, 2u>(Kokkos::Impl::ParallelFor<KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0>, Kokkos::TeamPolicy<Kokkos::Cuda, Kokkos::LaunchBounds<384u, 2u> >, Kokkos::Cuda>)>, numBlocks=0x7fffffffc4f8)
    at /home/projects/ppc64le-pwr8-nvidia/cuda/9.0.176/include/cuda_runtime.h:1411
1411      return ::cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags(numBlocks, (const void*)func, blockSize, dynamicSMemSize, cudaOccupancyDefault);
(cuda-gdb)
Kokkos::Impl::CudaGetMaxBlockSize<Kokkos::Impl::ParallelFor<KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0>, Kokkos::TeamPolicy<Kokkos::Cuda, Kokkos::LaunchBounds<384u, 2u> >, Kokkos::Cuda>, Kokkos::LaunchBounds<384u, 2u>, false>::get_block_size (shmem_extra_thread=<optimized out>, shmem_extra_block=<optimized out>, vector_length=16, f=...)
    at /ascldap/users/ndellin/kokkos-kernels/testing/White-Cuda9.0-Kepler/kokkos/install/include/Cuda/Kokkos_Cuda_Internal.hpp:174
174     while (blockSize*2<=int(MaxThreadsPerBlock) && numBlocks>0) {
(cuda-gdb)
KokkosBlas::Impl::GEMM<Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, false, true>::gemm (transA=0x104509f0 <._567+2164> "NN",
    transB=0x104509f1 <._567+2165> "N", alpha=@0x7fffffffce50: 5, A=..., B=..., beta=@0x7fffffffce58: 3, C=...)
    at /ascldap/users/ndellin/kokkos-kernels/src/blas/impl/KokkosBlas3_gemm_spec.hpp:167
167     gemm.run(team_size,vector_length,scratch_level);
(cuda-gdb)
KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0>::run (scr_level=0,
    vector_length=16, team_size=24, this=0x7fffffffc378) at /ascldap/users/ndellin/kokkos-kernels/src/blas/impl/KokkosBlas3_gemm_spec.hpp:167
167     gemm.run(team_size,vector_length,scratch_level);
(cuda-gdb)
Kokkos::parallel_for<Kokkos::TeamPolicy<Kokkos::Cuda, Kokkos::LaunchBounds<384u, 2u> >, KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0> > (functor=..., policy=..., str=...)
    at /ascldap/users/ndellin/kokkos-kernels/src/blas/impl/KokkosBlas3_gemm_spec.hpp:167
167     gemm.run(team_size,vector_length,scratch_level);
(cuda-gdb)
Kokkos::parallel_for<Kokkos::TeamPolicy<Kokkos::Cuda, Kokkos::LaunchBounds<384u, 2u> >, KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0> > (str=..., functor=..., policy=...)
    at /ascldap/users/ndellin/kokkos-kernels/src/blas/impl/KokkosBlas3_gemm_spec.hpp:167
167     gemm.run(team_size,vector_length,scratch_level);
(cuda-gdb)
Kokkos::Impl::ParallelFor<KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0>, Kokkos::TeamPolicy<Kokkos::Cuda, Kokkos::LaunchBounds<384u, 2u> >, Kokkos::Cuda>::ParallelFor (arg_policy=..., arg_functor=..., this=0x7fffffffc2b0)
    at /ascldap/users/ndellin/kokkos-kernels/src/blas/impl/KokkosBlas3_gemm_spec.hpp:167
167     gemm.run(team_size,vector_length,scratch_level);
(cuda-gdb)
Kokkos::Impl::cuda_get_max_block_size<Kokkos::Impl::ParallelFor<KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0>, Kokkos::TeamPolicy<Kokkos::Cuda, Kokkos::LaunchBounds<384u, 2u> >, Kokkos::Cuda>, Kokkos::LaunchBounds<384u, 2u> > (
    shmem_extra_thread=<optimized out>, shmem_extra_block=<optimized out>, vector_length=16, f=...)
    at /ascldap/users/ndellin/kokkos-kernels/src/blas/impl/KokkosBlas3_gemm_spec.hpp:167
167     gemm.run(team_size,vector_length,scratch_level);
(cuda-gdb)
Kokkos::Impl::CudaGetMaxBlockSize<Kokkos::Impl::ParallelFor<KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0>, Kokkos::TeamPolicy<Kokkos::Cuda, Kokkos::LaunchBounds<384u, 2u> >, Kokkos::Cuda>, Kokkos::LaunchBounds<384u, 2u>, false>::get_block_size (shmem_extra_thread=<optimized out>, shmem_extra_block=<optimized out>, vector_length=16, f=...)
    at /ascldap/users/ndellin/kokkos-kernels/testing/White-Cuda9.0-Kepler/kokkos/install/include/Cuda/Kokkos_Cuda_Internal.hpp:187
187     if( numBlocks    >= int(MinBlocksPerSM) && blockSize   <= int(MaxThreadsPerBlock) ) return blockSize;
(cuda-gdb)
Kokkos::Impl::ParallelFor<KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0>, Kokkos::TeamPolicy<Kokkos::Cuda, Kokkos::LaunchBounds<384u, 2u> >, Kokkos::Cuda>::ParallelFor (arg_policy=..., arg_functor=..., this=0x7fffffffc2b0)
    at /ascldap/users/ndellin/kokkos-kernels/testing/White-Cuda9.0-Kepler/kokkos/install/include/Cuda/Kokkos_Cuda_Parallel.hpp:729
729       if ( int(m_team_size) >
(cuda-gdb)
KokkosBlas::Impl::GEMM<Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, false, true>::gemm (transA=0x104509f0 <._567+2164> "NN",
    transB=0x104509f1 <._567+2165> "N", alpha=@0x7fffffffce50: 5, A=..., B=..., beta=@0x7fffffffce58: 3, C=...)
    at /ascldap/users/ndellin/kokkos-kernels/src/blas/impl/KokkosBlas3_gemm_spec.hpp:167
167     gemm.run(team_size,vector_length,scratch_level);
(cuda-gdb)
KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0>::run (scr_level=0,
    vector_length=16, team_size=24, this=0x7fffffffc378) at /ascldap/users/ndellin/kokkos-kernels/src/blas/impl/KokkosBlas3_gemm_spec.hpp:167
167     gemm.run(team_size,vector_length,scratch_level);
(cuda-gdb)
Kokkos::parallel_for<Kokkos::TeamPolicy<Kokkos::Cuda, Kokkos::LaunchBounds<384u, 2u> >, KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0> > (functor=..., policy=..., str=...)
    at /ascldap/users/ndellin/kokkos-kernels/src/blas/impl/KokkosBlas3_gemm_spec.hpp:167
167     gemm.run(team_size,vector_length,scratch_level);
(cuda-gdb)
Kokkos::parallel_for<Kokkos::TeamPolicy<Kokkos::Cuda, Kokkos::LaunchBounds<384u, 2u> >, KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0> > (str=..., functor=..., policy=...)
    at /ascldap/users/ndellin/kokkos-kernels/src/blas/impl/KokkosBlas3_gemm_spec.hpp:167
167     gemm.run(team_size,vector_length,scratch_level);
(cuda-gdb)
Kokkos::Impl::ParallelFor<KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0>, Kokkos::TeamPolicy<Kokkos::Cuda, Kokkos::LaunchBounds<384u, 2u> >, Kokkos::Cuda>::ParallelFor (arg_policy=..., arg_functor=..., this=0x7fffffffc2b0)
    at /ascldap/users/ndellin/kokkos-kernels/testing/White-Cuda9.0-Kepler/kokkos/install/include/Cuda/Kokkos_Cuda_Parallel.hpp:732
732         Kokkos::Impl::throw_runtime_exception(std::string("Kokkos::Impl::ParallelFor< Cuda > requested too large team size."));

srajama1 commented 6 years ago

There seem to be two errors. spgemm segfaulting (Is this even a problem ?) and gemm teamsize. Let us look at this together.

ndellingwood commented 6 years ago

@srajama1 the spgemm segfault was an artifact of the gemm test failure, probably the GPU memory was corrupted:


bash-4.2$ ./KokkosKernels_UnitTest_Cuda --gtest_filter=cuda.sparse_spgemm_double_int_int_TestExecSpace
Note: Google Test filter = cuda.sparse_spgemm_double_int_int_TestExecSpace
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from cuda
[ RUN      ] cuda.sparse_spgemm_double_int_int_TestExecSpace
[       OK ] cuda.sparse_spgemm_double_int_int_TestExecSpace (2792 ms)
[----------] 1 test from cuda (2792 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (2792 ms total)
[  PASSED  ] 1 test.

So we can just focus on the gemm test.

srajama1 commented 6 years ago

@ndellingwood : Thanks for checking. @crtrott : What is with this team size 24 ?

crtrott commented 6 years ago

I think it was an optimization choice. Need to check what valid options are.

ndellingwood commented 6 years ago

FWIW if I replace the team_size with Kokkos::AUTO in the team policy used by GEMMImpl<...>::run(team_size,vector_length,src_level) the unit test passes. @crtrott why is the team_size hard-coded to 24 at line 138 of src/blas/impl/KokkosBlas3_gemm_spec.hpp? I'm assuming prior to disabling deprecated code in Kokkos that this worked because the team_size was silently modified to the smaller max value that was allowed. Is it okay that I use Kokkos::AUTO as above as the fix? Alternatively, we could do a check similar to Kokkos' check during construction of the ParallelFor and reset the team_size if it is too large.

ndellingwood commented 6 years ago

For reference from src/blas/impl/KokkosBlas3_gemm_spec.hpp: The team_size is set to blockA0 https://github.com/kokkos/kokkos-kernels/blob/1a7b524ba38fdfab6c1058065af06cbcb4a2ce6f/src/blas/impl/KokkosBlas3_gemm_spec.hpp#L157

where blockA0 is hard-coded to 24 https://github.com/kokkos/kokkos-kernels/blob/1a7b524ba38fdfab6c1058065af06cbcb4a2ce6f/src/blas/impl/KokkosBlas3_gemm_spec.hpp#L138 and used after for determining the size of blockA1.

I added this info to track two possible changes that get unit tests passing:

Replace team_size with Kokkos::AUTO when creating the team policy used in run method of GEMMImpl
Reduce the blockA0 size from 24 to 16.

I'm assuming prior to disabling deprecated code that Kokkos acted in a way similar to 1, but wonder if option 2 is more important for the algorithm to keep team_size in some synchronization with the blockA0 and blockA1 values.

srajama1 commented 6 years ago

What does Kokkos::AUTO actually pick for this case ? Is this the recommended way to do this ?

kyungjoo-kim commented 6 years ago

@crtrott Says....

Try to change launchbounds<384,2> to <384,1>. Also we need to check performance by changing this.

Kokkos::TeamPolicy<ExecSpace,Kokkos::LaunchBounds<384,1>> policy(num_blocks_0*num_blocks_1,team_size,vector_length);

ndellingwood commented 6 years ago

@kyungjoo-kim thanks for suggestion, I tried changing launchbounds<384,2> to <384,1> but still results in a runtime test failure as before.

I tested with kokkos/kokkos@33a8ce7d33d4f1847fa99ec6d6e0d2150579fb83 which is VOD as of Aug 22.

@crtrott since the launchbounds change did not work is changing blockA0 to size 16 (when CUDA is enabled, can leave it as 24 otherwise via preprocessor macros) an acceptable alternative?

kyungjoo-kim commented 6 years ago

@crtrott I think that CUDA 9.2 is very problematic. @mperego trilinos/Trilinos#3340 also reports some cuda issues which have not happend in the previous version of CUDA.

crtrott commented 6 years ago

Ok this is actually an issue in Kokkos since it only tests block sizes of power of two. For this kernel 32 is too large but 24 would fix, but Kokkos said 16 is the max. I improved the Kokkos max-block-size functionality to make this work again.

crtrott commented 6 years ago

Yeah so now the answer for max team size is 24 ...

srajama1 commented 6 years ago

Closing it as it is fixed in Kokkos.

kokkos / kokkos-kernels

White Cuda Runtime Failure #280