spgemm unit test failing with cusparse

ndellingwood commented 6 years ago

The test KokkosKernels_UnitTest_Cuda --gtest_filter=cuda.sparse_spgemm_double_int_int_TestExecSpace fails at runtime with the following output:

terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-9.2-opt/SRC_AND_BUILD/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Impl.cpp:119
Traceback functionality not available

The first issue to address is updating the macro name used in the unit test to properly match that enabled when the cusparse tpl is enabled, see e.g. PR . Following that, the test still fails with this output:


terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaMemcpy( dst , src , n , cudaMemcpyDefault ) error( cudaErrorLaunchFailure): unspecified launch failure /ascldap/users/ndellin/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp:89
Traceback functionality not available

First reported in trilinos/Trilinos#3438

ndellingwood commented 6 years ago

Looks like this test began failing by v2.7.00. Including some debugging info as reference to start chasing this issue down:

From cuda-gdb this is failing in the cusparseXcsrgemmNnz call at this line: kokkos-kernels/src/sparse/impl/KokkosSparse_spgemm_cuSPARSE_impl.hpp:114

With this runtime message

CUDA Exception: Warp Out-of-range Address

Program received signal CUDA_EXCEPTION_5, Warp Out-of-range Address.
[Switching focus to CUDA kernel 1, grid 75, block (29,0,0), thread (0,0,0), device 0, sm 0, warp 7, lane 0]
0x0000000008d81a18 in void csrgemmNnz_kernel2<128, 32, 2, 4>(csrgemmNnz_params)<<<(2500,1,1),(128,1,1)>>> ()

From debug step-through it appears that the cuSparseHandle stuff is properly set.

srajama1 commented 6 years ago

Is it possible to cuda-memcheck on the failing test ? Apparently that EXCEPTION is a catch all for out of bounds in memory.

ndellingwood commented 5 years ago

I'm pretty sure the issue is that cusparse can't handle matrix that is being generated for the spgemm test.

Note: The spgemm test is testing A*A using two copies of a single generated sparse matrix.

Here's a list of things I've tried before reporting this:

Checked for any changes in the cusparse API related to the spgemm usage here - no changes. Fixed a couple macro bugs though they didn't have impact on the code.
Copied relevant parts of the unit test to create a standalone test outside of kokkos-kernels, same error.
From the standalone test, I removed all kokkos-kernels calls except the sparse matrix generator, I removed all the kokkos-kernels handle stuff, calls to spgemm_symbolic etc. and replaced with Cuda and CUSPARSE code. Same error.
I output the sparse matrix to matrix market format, and rewrote a raw Cuda code that read the matrix and tested it. Same error.
I tested the output matrix in Octave to see if it could handle the spgemm, no problems reading it or multiplying to get the result A*A.
I tested the code posted on this link by Robert Crovella as a sanity check, unaltered - it passed.
I took the code from (6) and replaced the internally generated matrices with the output matrix from (4) in case I made an error in my own code. Same error.

srajama1 commented 5 years ago

@ndellingwood : Thanks for checking this thoroughly. Does the matrix generated change in anyway from 2.6 and 2.7 in our code. A git log on that file file might be able to help. Also, cany you attach the matrices and the sample codes you did. I am going to forward this to NVIDIA.

crtrott commented 5 years ago

One thing you can do is check with CUDA 10 on kokkos-dev. Do module use /home/projects/modulefiles to load the module. That should have a new CuSparse version. Check your raw CUDA code.

ndellingwood commented 5 years ago

I'll test these things when I get a chance to login to White and kokkos-dev, thanks for the feedback!

srajama1 commented 5 years ago

@ndellingwood : Just a ping that we might want to test this on 10.1

seheracer commented 4 years ago

I just tested cusparse spgemm on 10.1. It is still failing with the same error. @srajama1 @ndellingwood @brian-kelley @lucbv

seheracer commented 4 years ago

I also tried running it with a smaller matrix of size 100 x 100. Scalar=Double failed with the same error without producing an output matrix, whereas Scalar=float failed after producing an output matrix. I dumped this output matrix (of cuSPARSE spgemm) to a file and compared it against the output matrices of SPGEMM_KK_MEMORY, SPGEMM_DEBUG, and Octave: 1) The number of nonzeros in the cuSPARSE version is wrong (It is 1262 but should be 1260). 2) Some of the column indices are off the limits. 3) Some entries match in their column indices but the value is wrong, some entries match in both column indices and values.

brian-kelley commented 4 years ago

@seheracer Could this a one-based vs zero-based indexing error?

brian-kelley commented 4 years ago

Hmm I don't think that's it, because cusparseCreateMatDescr will use CUSPARSE_INDEX_BASE_ZERO by default.

ndellingwood commented 4 years ago

When I tested this previously the issue seemed to come from CuSparse having issues with some sparse matrices that were generated for the test. I dumped the generated matrix to file and ran with a standalone CuSparse code (independent from Kokkos and KokkosKernels) and it died in that case as well. I tested the same matrix with either Octave or Matlab and had no issues. If I can find the standalone code I used I'll share.

lucbv commented 4 years ago

@ndellingwood I was debating trying to do a direct call to the Cuda kernel but if you already tried that then it's probably not worth it?

seheracer commented 4 years ago

@brian-kelley There are entries with column indices \in {101, 127, 128} in the first two rows of the output matrix. I checked the other rows as well, and couldn't see such off-the-limits entries in the rest of the matrix. Yet, mismatches in the column indices and values still occur in the rest of the matrix.

seheracer commented 4 years ago

When I tested this previously the issue seemed to come from CuSparse having issues with some sparse matrices that were generated for the test. I dumped the generated matrix to file and ran with a standalone CuSparse code (independent from Kokkos and KokkosKernels) and it died in that case as well. I tested the same matrix with either Octave or Matlab and had no issues. If I can find the standalone code I used I'll share.

I think we shouldn't spend more time on this until we have the Cuda 11 version.

brian-kelley commented 4 years ago

@seheracer But we'll still have to support Cuda 8-10.

kokkos / kokkos-kernels

spgemm unit test failing with cusparse #296