Closed ndellingwood closed 6 years ago
Does it change based on Cuda version ? We have nightlies on white that don't have this problem.
@srajama1 I'll check with cuda/9.0 and hop over to the pascal queue and do some testing there.
@srajama1 same error using cuda/9.0.176 + gcc/6.4.0 (gcc/7.2.0 is not compatible with our cuda/9.0.x compilers).
[ RUN ] cuda.gemm_double
unknown file: Failure
C++ exception with description "Kokkos::Impl::ParallelFor< Cuda > requested too large team size.
Traceback functionality not available
" thrown in the test body.
[ FAILED ] cuda.gemm_double (110 ms)
[ RUN ] cuda.sparse_spgemm_double_int_int_TestExecSpace
Segmentation fault (core dumped)
@srajama1 The description:
C++ exception with description "Kokkos::Impl::ParallelFor< Cuda > requested too large team size.
possibly suggests that a team_size is hard-coded too large for the hardware, which currently should silently adjust to the max possible team_size unless deprecated code is disabled in which case this behavior is not allowed, but I did not pass a flag to disable deprecated code... maybe the switch was flipped in the kokkos develop branch to disable deprecated code by default. If so I can start looking at that as the cause, though I thought I caught all these cases in KokkosKernels awhile back...
Reference output using the kernel-logger from the kokkos-tools to help pinpoint where error occurs:
KokkosP: Entering profiling region: KokkosBlas::gemm[ETI]
KokkosP: Executing parallel-for kernel on device 0 with unique execution identifier 18
KokkosP: KokkosBlas::Test::gemm_double
KokkosP: KokkosBlas::gemm[ETI]
KokkosP: KokkosBlas::gemm[NN]
KokkosP: Deallocate<Cuda> name: Kokkos::Random_XorShift64::state pointer: 0x8304820080 size: 1048576
KokkosP: Deallocate<Cuda> name: Kokkos::Random_XorShift64::locks pointer: 0x8304720080 size: 524288
KokkosP: Deallocate<Cuda> name: C pointer: 0x8304522880 size: 1560
KokkosP: Deallocate<Cuda> name: C pointer: 0x8304522080 size: 1560
KokkosP: Deallocate<Cuda> name: B pointer: 0x8304521680 size: 2040
KokkosP: Deallocate<Cuda> name: A pointer: 0x8304520e80 size: 1768
unknown file: Failure
C++ exception with description "Kokkos::Impl::ParallelFor< Cuda > requested too large team size.
Traceback functionality not available
" thrown in the test body.
[ FAILED ] cuda.gemm_double (9 ms)
Edit: More reference notes
KokkosP: KokkosBlas::gemm[NN]
may indicate this is dying somewhere in src/blas/impl/KokkosBlas3_gemm_impl.hpp
The unittest impl code itself uses parallel* patterns with team_policy that uses Kokkos::AUTO
for the team_size, so in this location my guess about behavior changes with team_size and disabled deprecated code is wrong.
Update:
In src/blas/impl/KokkosBlas3_gemm_impl.hpp
there is a GEMMImpl<...>::run(...)
method that takes team_size as a runtime argument and then uses this to construct a team_policy
.
src/blas/impl/KokkosBlas3_gemm_spec.hpp
contains the GEMM<...>::gemm(...)
calls the above run
method where the following info is used to set the team_size:
line 138:
static constexpr int blockA0 = 24;
...
line 154:
// Figure out Team Sizes
int team_size = 1;
#if defined(KOKKOS_ENABLE_CUDA)
if(std::is_same<typename CViewType::execution_space,Kokkos::Cuda>::value)
team_size = blockA0;
#endif
@srajama1 @crtrott does this algorithm require that the team_size be hard coded to a particular value, or is this value chosen based on performance? I don't want to blindly replace the team size with Kokkos::AUTO
without knowing better how it is used by the algorithm.
Edit: TeamPolicy construction plus parallel_for
call in run
of src/blas/impl/KokkosBlas3_gemm_impl.hpp
beginning line 454
Kokkos::TeamPolicy<ExecSpace,Kokkos::LaunchBounds<384,2>> policy(num_blocks_0*num_blocks_1,team_size,vector_length);
Kokkos::parallel_for(impl_gemm_label<TransposeA,TransposeB>::label,policy.set_scratch_size(scratch_level,Kokkos::PerTeam(scratch_memory_size)),*this);
This is a verbose message but reference step info from cuda-gdb the last point in GEMMImpl
until the code fails:
(cuda-gdb)
Kokkos::Impl::CudaGetMaxBlockSize<Kokkos::Impl::ParallelFor<KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0>, Kokkos::TeamPolicy<Kokkos::Cuda, Kokkos::LaunchBounds<384u, 2u> >, Kokkos::Cuda>, Kokkos::LaunchBounds<384u, 2u>, false>::get_block_size (shmem_extra_thread=<optimized out>, shmem_extra_block=<optimized out>, vector_length=16, f=...)
at /ascldap/users/ndellin/kokkos-kernels/testing/White-Cuda9.0-Kepler/kokkos/install/include/Cuda/Kokkos_Cuda_Internal.hpp:176
176 sharedmem = shmem_extra_block + shmem_extra_thread*(blockSize/vector_length) +
(cuda-gdb)
181 cudaOccupancyMaxActiveBlocksPerMultiprocessor(
(cuda-gdb)
cudaOccupancyMaxActiveBlocksPerMultiprocessor<void (*)(Kokkos::Impl::ParallelFor<KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0>, Kokkos::TeamPolicy<Kokkos::Cuda, Kokkos::LaunchBounds<384u, 2u> >, Kokkos::Cuda>)> (dynamicSMemSize=23552,
blockSize=256,
func=0x10349b40 <Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0>, Kokkos::TeamPolicy<Kokkos::Cuda, Kokkos::LaunchBounds<384u, 2u> >, Kokkos::Cuda>, 384u, 2u>(Kokkos::Impl::ParallelFor<KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0>, Kokkos::TeamPolicy<Kokkos::Cuda, Kokkos::LaunchBounds<384u, 2u> >, Kokkos::Cuda>)>, numBlocks=0x7fffffffc4f8)
at /home/projects/ppc64le-pwr8-nvidia/cuda/9.0.176/include/cuda_runtime.h:1411
1411 return ::cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags(numBlocks, (const void*)func, blockSize, dynamicSMemSize, cudaOccupancyDefault);
(cuda-gdb)
Kokkos::Impl::CudaGetMaxBlockSize<Kokkos::Impl::ParallelFor<KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0>, Kokkos::TeamPolicy<Kokkos::Cuda, Kokkos::LaunchBounds<384u, 2u> >, Kokkos::Cuda>, Kokkos::LaunchBounds<384u, 2u>, false>::get_block_size (shmem_extra_thread=<optimized out>, shmem_extra_block=<optimized out>, vector_length=16, f=...)
at /ascldap/users/ndellin/kokkos-kernels/testing/White-Cuda9.0-Kepler/kokkos/install/include/Cuda/Kokkos_Cuda_Internal.hpp:174
174 while (blockSize*2<=int(MaxThreadsPerBlock) && numBlocks>0) {
(cuda-gdb)
KokkosBlas::Impl::GEMM<Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, false, true>::gemm (transA=0x104509f0 <._567+2164> "NN",
transB=0x104509f1 <._567+2165> "N", alpha=@0x7fffffffce50: 5, A=..., B=..., beta=@0x7fffffffce58: 3, C=...)
at /ascldap/users/ndellin/kokkos-kernels/src/blas/impl/KokkosBlas3_gemm_spec.hpp:167
167 gemm.run(team_size,vector_length,scratch_level);
(cuda-gdb)
KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0>::run (scr_level=0,
vector_length=16, team_size=24, this=0x7fffffffc378) at /ascldap/users/ndellin/kokkos-kernels/src/blas/impl/KokkosBlas3_gemm_spec.hpp:167
167 gemm.run(team_size,vector_length,scratch_level);
(cuda-gdb)
Kokkos::parallel_for<Kokkos::TeamPolicy<Kokkos::Cuda, Kokkos::LaunchBounds<384u, 2u> >, KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0> > (functor=..., policy=..., str=...)
at /ascldap/users/ndellin/kokkos-kernels/src/blas/impl/KokkosBlas3_gemm_spec.hpp:167
167 gemm.run(team_size,vector_length,scratch_level);
(cuda-gdb)
Kokkos::parallel_for<Kokkos::TeamPolicy<Kokkos::Cuda, Kokkos::LaunchBounds<384u, 2u> >, KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0> > (str=..., functor=..., policy=...)
at /ascldap/users/ndellin/kokkos-kernels/src/blas/impl/KokkosBlas3_gemm_spec.hpp:167
167 gemm.run(team_size,vector_length,scratch_level);
(cuda-gdb)
Kokkos::Impl::ParallelFor<KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0>, Kokkos::TeamPolicy<Kokkos::Cuda, Kokkos::LaunchBounds<384u, 2u> >, Kokkos::Cuda>::ParallelFor (arg_policy=..., arg_functor=..., this=0x7fffffffc2b0)
at /ascldap/users/ndellin/kokkos-kernels/src/blas/impl/KokkosBlas3_gemm_spec.hpp:167
167 gemm.run(team_size,vector_length,scratch_level);
(cuda-gdb)
Kokkos::Impl::cuda_get_max_block_size<Kokkos::Impl::ParallelFor<KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0>, Kokkos::TeamPolicy<Kokkos::Cuda, Kokkos::LaunchBounds<384u, 2u> >, Kokkos::Cuda>, Kokkos::LaunchBounds<384u, 2u> > (
shmem_extra_thread=<optimized out>, shmem_extra_block=<optimized out>, vector_length=16, f=...)
at /ascldap/users/ndellin/kokkos-kernels/src/blas/impl/KokkosBlas3_gemm_spec.hpp:167
167 gemm.run(team_size,vector_length,scratch_level);
(cuda-gdb)
Kokkos::Impl::CudaGetMaxBlockSize<Kokkos::Impl::ParallelFor<KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0>, Kokkos::TeamPolicy<Kokkos::Cuda, Kokkos::LaunchBounds<384u, 2u> >, Kokkos::Cuda>, Kokkos::LaunchBounds<384u, 2u>, false>::get_block_size (shmem_extra_thread=<optimized out>, shmem_extra_block=<optimized out>, vector_length=16, f=...)
at /ascldap/users/ndellin/kokkos-kernels/testing/White-Cuda9.0-Kepler/kokkos/install/include/Cuda/Kokkos_Cuda_Internal.hpp:187
187 if( numBlocks >= int(MinBlocksPerSM) && blockSize <= int(MaxThreadsPerBlock) ) return blockSize;
(cuda-gdb)
Kokkos::Impl::ParallelFor<KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0>, Kokkos::TeamPolicy<Kokkos::Cuda, Kokkos::LaunchBounds<384u, 2u> >, Kokkos::Cuda>::ParallelFor (arg_policy=..., arg_functor=..., this=0x7fffffffc2b0)
at /ascldap/users/ndellin/kokkos-kernels/testing/White-Cuda9.0-Kepler/kokkos/install/include/Cuda/Kokkos_Cuda_Parallel.hpp:729
729 if ( int(m_team_size) >
(cuda-gdb)
KokkosBlas::Impl::GEMM<Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, false, true>::gemm (transA=0x104509f0 <._567+2164> "NN",
transB=0x104509f1 <._567+2165> "N", alpha=@0x7fffffffce50: 5, A=..., B=..., beta=@0x7fffffffce58: 3, C=...)
at /ascldap/users/ndellin/kokkos-kernels/src/blas/impl/KokkosBlas3_gemm_spec.hpp:167
167 gemm.run(team_size,vector_length,scratch_level);
(cuda-gdb)
KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0>::run (scr_level=0,
vector_length=16, team_size=24, this=0x7fffffffc378) at /ascldap/users/ndellin/kokkos-kernels/src/blas/impl/KokkosBlas3_gemm_spec.hpp:167
167 gemm.run(team_size,vector_length,scratch_level);
(cuda-gdb)
Kokkos::parallel_for<Kokkos::TeamPolicy<Kokkos::Cuda, Kokkos::LaunchBounds<384u, 2u> >, KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0> > (functor=..., policy=..., str=...)
at /ascldap/users/ndellin/kokkos-kernels/src/blas/impl/KokkosBlas3_gemm_spec.hpp:167
167 gemm.run(team_size,vector_length,scratch_level);
(cuda-gdb)
Kokkos::parallel_for<Kokkos::TeamPolicy<Kokkos::Cuda, Kokkos::LaunchBounds<384u, 2u> >, KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0> > (str=..., functor=..., policy=...)
at /ascldap/users/ndellin/kokkos-kernels/src/blas/impl/KokkosBlas3_gemm_spec.hpp:167
167 gemm.run(team_size,vector_length,scratch_level);
(cuda-gdb)
Kokkos::Impl::ParallelFor<KokkosBlas::Impl::GEMMImpl<Kokkos::Cuda, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double const**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, Kokkos::View<double**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<1u> >, 24, 16, 64, 0, 0>, Kokkos::TeamPolicy<Kokkos::Cuda, Kokkos::LaunchBounds<384u, 2u> >, Kokkos::Cuda>::ParallelFor (arg_policy=..., arg_functor=..., this=0x7fffffffc2b0)
at /ascldap/users/ndellin/kokkos-kernels/testing/White-Cuda9.0-Kepler/kokkos/install/include/Cuda/Kokkos_Cuda_Parallel.hpp:732
732 Kokkos::Impl::throw_runtime_exception(std::string("Kokkos::Impl::ParallelFor< Cuda > requested too large team size."));
There seem to be two errors. spgemm segfaulting (Is this even a problem ?) and gemm teamsize. Let us look at this together.
@srajama1 the spgemm segfault was an artifact of the gemm test failure, probably the GPU memory was corrupted:
bash-4.2$ ./KokkosKernels_UnitTest_Cuda --gtest_filter=cuda.sparse_spgemm_double_int_int_TestExecSpace
Note: Google Test filter = cuda.sparse_spgemm_double_int_int_TestExecSpace
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from cuda
[ RUN ] cuda.sparse_spgemm_double_int_int_TestExecSpace
[ OK ] cuda.sparse_spgemm_double_int_int_TestExecSpace (2792 ms)
[----------] 1 test from cuda (2792 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (2792 ms total)
[ PASSED ] 1 test.
So we can just focus on the gemm test.
@ndellingwood : Thanks for checking. @crtrott : What is with this team size 24 ?
I think it was an optimization choice. Need to check what valid options are.
FWIW if I replace the team_size with Kokkos::AUTO
in the team policy used by GEMMImpl<...>::run(team_size,vector_length,src_level)
the unit test passes.
@crtrott why is the team_size hard-coded to 24 at line 138 of src/blas/impl/KokkosBlas3_gemm_spec.hpp
? I'm assuming prior to disabling deprecated code in Kokkos that this worked because the team_size was silently modified to the smaller max value that was allowed.
Is it okay that I use Kokkos::AUTO
as above as the fix?
Alternatively, we could do a check similar to Kokkos' check during construction of the ParallelFor and reset the team_size if it is too large.
For reference from src/blas/impl/KokkosBlas3_gemm_spec.hpp
:
The team_size
is set to blockA0
https://github.com/kokkos/kokkos-kernels/blob/1a7b524ba38fdfab6c1058065af06cbcb4a2ce6f/src/blas/impl/KokkosBlas3_gemm_spec.hpp#L157
where blockA0
is hard-coded to 24 https://github.com/kokkos/kokkos-kernels/blob/1a7b524ba38fdfab6c1058065af06cbcb4a2ce6f/src/blas/impl/KokkosBlas3_gemm_spec.hpp#L138
and used after for determining the size of blockA1
.
I added this info to track two possible changes that get unit tests passing:
Replace team_size
with Kokkos::AUTO
when creating the team policy used in run
method of GEMMImpl
Reduce the blockA0
size from 24 to 16.
I'm assuming prior to disabling deprecated code that Kokkos acted in a way similar to 1, but wonder if option 2 is more important for the algorithm to keep team_size in some synchronization with the blockA0 and blockA1 values.
What does Kokkos::AUTO actually pick for this case ? Is this the recommended way to do this ?
@crtrott Says....
Try to change launchbounds<384,2> to <384,1>. Also we need to check performance by changing this.
Kokkos::TeamPolicy<ExecSpace,Kokkos::LaunchBounds<384,1>> policy(num_blocks_0*num_blocks_1,team_size,vector_length);
@kyungjoo-kim thanks for suggestion, I tried changing launchbounds<384,2> to <384,1> but still results in a runtime test failure as before.
@crtrott since the launchbounds change did not work is changing blockA0 to size 16 (when CUDA is enabled, can leave it as 24 otherwise via preprocessor macros) an acceptable alternative?
@crtrott I think that CUDA 9.2 is very problematic. @mperego trilinos/Trilinos#3340 also reports some cuda issues which have not happend in the previous version of CUDA.
Ok this is actually an issue in Kokkos since it only tests block sizes of power of two. For this kernel 32 is too large but 24 would fix, but Kokkos said 16 is the max. I improved the Kokkos max-block-size functionality to make this work again.
Yeah so now the answer for max team size is 24 ...
Closing it as it is fixed in Kokkos.
On White with a Cuda/9.2 bulid the following test failure occurs:
Modules loaded:
Queue and Arch: rhel7F queue (Kepler K80 GPUs) Power8,Kepler37 archs
Edit: Adding SHAs and script for generating makefiles
Kokkos SHA: kokkos/kokkos@5fe980f4b6d672cd58aab0e8685d8e1dfc9f809e KokkosKernels SHA: 9a9845e14234eb8f8edb7c1331c9a228eb24048d
Script:
Reproducing instructions:
Created testing sub-directory within KokkosKernels kokkos-kernels/testing/White-Cuda9.2
make install-lib -j16
cd unit_test
make -j32
run tests ./KokkosKernels_UnitTest_Cuda