Cuda Sparse Tests Failing on Turing GPU

euphoricpoptarts commented 4 years ago

I am using kokkos-kernels on an Ubuntu 20.04 machine that has one Nvidia RTX 2080 Ti. I have compiled the kokkos-kernels source and run the tests with 'ctest --verbose' and I get the following failures from the Cuda-Sparse testset: 2: [----------] Global test environment tear-down 2: [==========] 47 tests from 1 test case ran. (13867 ms total) 2: [ PASSED ] 31 tests. 2: [ FAILED ] 16 tests, listed below: 2: [ FAILED ] cuda.sparse_spgemm_double_int_int_TestExecSpace 2: [ FAILED ] cuda.sparse_spgemm_double_int_size_t_TestExecSpace 2: [ FAILED ] cuda.sparse_spgemm_jacobi_double_int_int_TestExecSpace 2: [ FAILED ] cuda.sparse_spgemm_jacobi_double_int_size_t_TestExecSpace 2: [ FAILED ] cuda.sparse_spiluk_double_int_int_TestExecSpace 2: [ FAILED ] cuda.sparse_spiluk_double_int_size_t_TestExecSpace 2: [ FAILED ] cuda.sparse_spmv_double_int_int_TestExecSpace 2: [ FAILED ] cuda.sparse_spmv_struct_double_int_int_TestExecSpace 2: [ FAILED ] cuda.sparse_spmv_double_int_size_t_TestExecSpace 2: [ FAILED ] cuda.sparse_spmv_struct_double_int_size_t_TestExecSpace 2: [ FAILED ] cuda.sparse_spmv_mv_double_int_int_LayoutLeft_TestExecSpace 2: [ FAILED ] cuda.sparse_spmv_mv_struct_double_int_int_LayoutLeft_TestExecSpace 2: [ FAILED ] cuda.sparse_spmv_mv_double_int_size_t_LayoutLeft_TestExecSpace 2: [ FAILED ] cuda.sparse_spmv_mv_struct_double_int_size_t_LayoutLeft_TestExecSpace 2: [ FAILED ] cuda.sparse_sptrsv_double_int_int_TestExecSpace 2: [ FAILED ] cuda.sparse_sptrsv_double_int_size_t_TestExecSpace

I looked through the code used by the Cuda implementation of SpGemm and found two files with the following preprocessor macro: #if defined(KOKKOS_ARCH_VOLTA) || defined(KOKKOS_ARCH_VOLTA70) || defined(KOKKOS_ARCH_VOLTA72)
In the comments below this line it references an architecture change in Volta GPUs which requires the code to have unique behavior in the case of Volta GPUs. I figured that the same architecture differences might be present in Turing GPUs, so I changed the occurrences of this preprocessor macro in KokkosKernels_HashmapAccumulator.hpp and KokkosSparse_spgemm_impl_compression.hpp to: #if defined(KOKKOS_ARCH_VOLTA) || defined(KOKKOS_ARCH_VOLTA70) || defined(KOKKOS_ARCH_VOLTA72) || defined(KOKKOS_ARCH_TURING) || defined(KOKKOS_ARCH_TURING75) After making these changes and recompiling, the previously failed tests pass.

srajama1 commented 4 years ago

@crtrott : Can you check this ? I am somewhat worried about doing special things for the specific versions. It would be nice to abstract this somehow. How would one do it ?

brian-kelley commented 4 years ago

@srajama1,

@euphoricpoptarts has the right fix. that #if condition should be true for all Volta and later architectures. The issue was that pre-Volta, a ThreadVectorRange loop was guaranteed to have all vector lanes executing the same instruction at any given time, but with Volta that's no longer true (the warps can split into 2 independent groups now).

brian-kelley commented 4 years ago

I suggested to Christian in the kokkos slack to add a macro KOKKOS_CUDA_ARCH_VERSION or something similar that would let us replace this line with just #if (KOKKOS_CUDA_ARCH_VERSION >= 700), rather than checking for every possible arch enable macro.

euphoricpoptarts commented 4 years ago

Do you think it might also be reasonable to remove the macro entirely and execute the additional instruction regardless of architecture?

brian-kelley commented 4 years ago

@euphoricpoptarts I think that would cause a measurable slowdown on architectures where it's not necessary.

euphoricpoptarts commented 4 years ago

@brian-kelley I see, in that case I agree that a numeric ARCH_VERSION macro is the best solution.

srajama1 commented 4 years ago

@brian-kelley I saw this discussion in slack. If @crtrott adds the flag to Kokkos we can use it.

kokkos / kokkos-kernels

Cuda Sparse Tests Failing on Turing GPU #765