ginkgo-project / ginkgo

Numerical linear algebra software package
https://ginkgo-project.github.io/
BSD 3-Clause "New" or "Revised" License
407 stars 88 forks source link

About an error encountered in running spgemm #1166

Closed xwentian2020 closed 2 years ago

xwentian2020 commented 2 years ago

Tests had been made on A100 80G for sparse matrix and matrix multiplication and faults were got for all the modes (normal, transposed, sparse, dense) in spgemm, as shown below.

        This is Ginkgo 1.5.0 (develop)
          running with core module 1.5.0 (develop)
          the reference module is  1.5.0 (develop)
          the OpenMP    module is  not compiled
          the CUDA      module is  1.5.0 (develop)
          the HIP       module is  not compiled
          the DPCPP     module is  not compiled
        Running on cuda(0)
        Running with 2 warm iterations and 10 running iterations
        The random seed for right hand sides is 42
        The operations are spgemm,spgeam,transposeRunning test case: {
          "filename": "/nfs/site/home/xuwentia/spmv/suitesparse_intel/MM/Cunningham/m3plates/m3plates.mtx",
          "sparse_blas": {}
        }
        Matrix is of size (11107, 11107), 6639
        /localdisk/slurm/slurmd/job79761/slurm_script: line 67: 3444286 Segmentation fault      (core dumped) ${BENCHMARKING_EXE} ${SPARSE_BLAS_FLAGS} < ${MTXFILELIST} > ${DERIVEDJSON}

In the tests, the command was ./ginkgo/build/release_a100/benchmark/sparse_blas/sparse_blas_single --executor=cuda --gpu_timer=true --operations=spgemm,spgeam,transpose --strategies=classical --spgemm_mode=normal spgemm_results.json

upsj commented 2 years ago

Could you please attach a backtrace of the segfault, as well as the output of the CMake configure call, so we see what compilers and library versions you use? We call the cuSPARSE SpGEMM functions internally, so this might be either an issue on our side or a bug in cuSPARSE.

xwentian2020 commented 2 years ago
  ---------------------------------------------------------------------------------------------------------
  --
  --    Summary of Configuration for Ginkgo (version 1.5.0 with tag develop, shortrev 6a9e45912)
  --    Ginkgo configuration:
  --        CMAKE_BUILD_TYPE:                           Release
  --        BUILD_SHARED_LIBS:                          ON
  --        CMAKE_INSTALL_PREFIX:                       /nfs/site/home/xuwentia/spmv/ginkgo_20221015/release_a100
  --        PROJECT_SOURCE_DIR:                         /nfs/site/home/xuwentia/spmv/ginkgo_20221015/ginkgo
  --        PROJECT_BINARY_DIR:                         /nfs/site/home/xuwentia/spmv/ginkgo_20221015/ginkgo/build/release_a100
  --        CMAKE_CXX_COMPILER:                         GNU 11.2.0 on platform Linux x86_64
  --                                                    /usr/bin/c++
  --    User configuration:
  --      Enabled modules:
  --        GINKGO_BUILD_OMP:                           OFF
  --        GINKGO_BUILD_MPI:                           OFF
  --        GINKGO_BUILD_REFERENCE:                     ON
  --        GINKGO_BUILD_CUDA:                          ON
  --        GINKGO_BUILD_HIP:                           OFF
  --        GINKGO_BUILD_DPCPP:                         OFF
  --      Enabled features:
  --        GINKGO_MIXED_PRECISION:                     OFF
  --      Tests, benchmarks and examples:
  --        GINKGO_BUILD_TESTS:                         ON
  --        GINKGO_FAST_TESTS:                          OFF
  --        GINKGO_BUILD_EXAMPLES:                      ON
  --        GINKGO_EXTLIB_EXAMPLE:                      
  --        GINKGO_BUILD_BENCHMARKS:                    ON
  --        GINKGO_BENCHMARK_ENABLE_TUNING:             OFF
  --      Documentation:
  --        GINKGO_BUILD_DOC:                           OFF
  --        GINKGO_VERBOSE_LEVEL:                       1
  --    
  ---------------------------------------------------------------------------------------------------------
  --
  --    Compiled Modules
  --
  ---------------------------------------------------------------------------------------------------------
  --
  --    The Core module is being compiled.
  --
  --    CMake related Core module variables:
  --        BUILD_SHARED_LIBS:                          ON
  --        CMAKE_C_COMPILER:                           /usr/bin/gcc
  --        CMAKE_C_FLAGS_RELEASE:                      -O3 -DNDEBUG
  --        CMAKE_CXX_COMPILER:                         /usr/bin/c++
  --        CMAKE_CXX_FLAGS_RELEASE:                    -O3 -DNDEBUG
  --        CMAKE_GENERATOR:                            Unix Makefiles
  --    
  ---------------------------------------------------------------------------------------------------------
  --
  --    The Reference module is being compiled.
  --
  --    CMake related Reference module variables:
  --        GINKGO_BUILD_REFERENCE:                     ON
  --        GINKGO_COMPILER_FLAGS:                      -Wpedantic
  --    
  ---------------------------------------------------------------------------------------------------------
  --
  --    The CUDA module is being compiled.
  --
  --    CMake related CUDA module variables:
  --        GINKGO_CUDA_ARCHITECTURES:                  Ampere
  --        GINKGO_CUDA_COMPILER_FLAGS:                 <empty>
  --        GINKGO_CUDA_DEFAULT_HOST_COMPILER:          OFF
  --        GINKGO_CUDA_ARCH_FLAGS:                     --generate-code=arch=compute_80,code=sm_80;--generate-code=arch=compute_86,code=sm_86
  --    CUDA variables:
  --        CMAKE_CUDA_COMPILER:                        /usr/local/cuda/bin/nvcc
  --        CMAKE_CUDA_COMPILER_VERSION:                11.1.105
  --        CMAKE_CUDA_FLAGS_RELEASE:                   -O3 -DNDEBUG
  --        CMAKE_CUDA_HOST_COMPILER:                   /usr/bin/c++
  --        CUDA_INCLUDE_DIRS:                          /usr/local/cuda/targets/x86_64-linux/include
  --    CUDA Libraries:
  --        CUBLAS:                                     /usr/local/cuda/targets/x86_64-linux/lib/stubs/libcublas.so
  --        CUDA_RUNTIME_LIBS:                          /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so
  --        CUSPARSE:                                   /usr/local/cuda/targets/x86_64-linux/lib/stubs/libcusparse.so
  --    
  ---------------------------------------------------------------------------------------------------------
  --
  --      Developer Tools:
  --        GINKGO_DEVEL_TOOLS:                         OFF
  --        GINKGO_WITH_CLANG_TIDY:                     OFF
  --        GINKGO_WITH_IWYU:                           OFF
  --        GINKGO_CHECK_CIRCULAR_DEPS:                 OFF
  --        GINKGO_WITH_CCACHE:                         ON
  --      CCACHE:
  --        CCACHE_PROGRAM:                             CCACHE_PROGRAM-NOTFOUND
  --        CCACHE_DIR:                                 <empty>
  --        CCACHE_MAXSIZE:                             <empty>
  --      PATH of other tools:
  --        GINKGO_CLANG_TIDY_PATH:                     
  --        GINKGO_IWYU_PATH:                           
  --    
  ---------------------------------------------------------------------------------------------------------
  --
  --      Components:
  --        GINKGO_BUILD_HWLOC:                         OFF
  --        HWLOC_VERSION:                              
  --        HWLOC_LIBRARIES:                            
  --        HWLOC_INCLUDE_DIRS:                         
  --
  --  Now, run  cmake --build .  to compile Ginkgo!
  --
  ---------------------------------------------------------------------------------------------------------
upsj commented 2 years ago

I'm trying to reproduce this, but if it's possible for you, please attach a backtrace generated by loading the coredump into gdb and calling backtrace on it.

upsj commented 2 years ago

There seems to be something off with your configuration: CUDA 11.1 doesn't support GCC 11.2, so which host compiler is nvcc using?

xwentian2020 commented 2 years ago

I collected some information on the machine equipped with A100 80G.

xuwentia@ortce-a100-80G2:~/spmv/ginkgo_20221015/ginkgo/build/release_a100/benchmark/sparse_blas$ 
xuwentia@ortce-a100-80G2:~/spmv/ginkgo_20221015/ginkgo/build/release_a100/benchmark/sparse_blas$ ./sparse_blas_single --executor=cuda --gpu_timer=true --operations=spgemm --strategies=classical --spgemm_mode=dense </nfs/site/home/xuwentia/spmv/suitesparse_intel/mtxlist_vtune/mtxlist.json> sparse_blas_single_spgemm_dense.json
This is Ginkgo 1.5.0 (develop)
    running with core module 1.5.0 (develop)
    the reference module is  1.5.0 (develop)
    the OpenMP    module is  not compiled
    the CUDA      module is  1.5.0 (develop)
    the HIP       module is  not compiled
    the DPCPP     module is  not compiled
Running on cuda(0)
Running with 2 warm iterations and 10 running iterations
The random seed for right hand sides is 42
The operations are spgemmRunning test case: {
    "filename": "/nfs/site/home/xuwentia/spmv/suitesparse_intel/MM/Cunningham/m3plates/m3plates.mtx",
    "sparse_blas": {}
}
Matrix is of size (11107, 11107), 6639
Segmentation fault (core dumped)
xuwentia@ortce-a100-80G2:~/spmv/ginkgo_20221015/ginkgo/build/release_a100/benchmark/sparse_blas$ 
xuwentia@ortce-a100-80G2:~/spmv/ginkgo_20221015/ginkgo/build/release_a100/benchmark/sparse_blas$ whereis cuda-gdb
cuda-gdb: /usr/local/cuda-11.7/bin/cuda-gdb
xuwentia@ortce-a100-80G2:~/spmv/ginkgo_20221015/ginkgo/build/release_a100/benchmark/sparse_blas$ 
xuwentia@ortce-a100-80G2:~/spmv/ginkgo_20221015/ginkgo/build/release_a100/benchmark/sparse_blas$ cuda-gdb --version
NVIDIA (R) CUDA Debugger
11.7 release
Portions Copyright (C) 2007-2022 NVIDIA Corporation
GNU gdb (GDB) 10.2
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
xuwentia@ortce-a100-80G2:~/spmv/ginkgo_20221015/ginkgo/build/release_a100/benchmark/sparse_blas$ 
xuwentia@ortce-a100-80G2:~/spmv/ginkgo_20221015/ginkgo/build/release_a100/benchmark/sparse_blas$ 
xuwentia@ortce-a100-80G2:~/spmv/ginkgo_20221015/ginkgo/build/release_a100/benchmark/sparse_blas$ cuda-gdb ./sparse_blas
NVIDIA (R) CUDA Debugger
11.7 release
Portions Copyright (C) 2007-2022 NVIDIA Corporation
GNU gdb (GDB) 10.2
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./sparse_blas...
(No debugging symbols found in ./sparse_blas)
(cuda-gdb) r --executor=cuda --gpu_timer=true --operations=spgemm --strategies=classical --spgemm_mode=dense </nfs/site/home/xuwentia/spmv/suitesparse_intel/mtxlist_vtune/mtxlist.json> sparse_blas_single_spgemm_dense.json
Starting program: /nfs/site/home/xuwentia/spmv/ginkgo_20221015/ginkgo/build/release_a100/benchmark/sparse_blas/sparse_blas --executor=cuda --gpu_timer=true --operations=spgemm --strategies=classical --spgemm_mode=dense </nfs/site/home/xuwentia/spmv/suitesparse_intel/mtxlist_vtune/mtxlist.json> sparse_blas_single_spgemm_dense.json
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 2297838]
[New Thread 0x155508da4000 (LWP 2297844)]
[New Thread 0x155508ba3000 (LWP 2297845)]
This is Ginkgo 1.5.0 (develop)
    running with core module 1.5.0 (develop)
    the reference module is  1.5.0 (develop)
    the OpenMP    module is  not compiled
    the CUDA      module is  1.5.0 (develop)
    the HIP       module is  not compiled
    the DPCPP     module is  not compiled
Running on cuda(0)
Running with 2 warm iterations and 10 running iterations
The random seed for right hand sides is 42
The operations are spgemmRunning test case: {
    "filename": "/nfs/site/home/xuwentia/spmv/suitesparse_intel/MM/Cunningham/m3plates/m3plates.mtx",
    "sparse_blas": {}
}
Matrix is of size (11107, 11107), 6639

Thread 1 "sparse_blas" received signal SIGSEGV, Segmentation fault.
0x000055555558f830 in SpgemmOperation::get_flops() const ()
(cuda-gdb) backtrace
#0  0x000055555558f830 in SpgemmOperation::get_flops() const ()
#1  0x0000555555575e28 in apply_sparse_blas(char const*, std::shared_ptr<gko::Executor>, gko::matrix::Csr<double, int> const*, rapidjson::GenericValue<rapidjson::UTF8<char>, rapidjson::MemoryPoolAllocator<rapidjson::CrtAllocator> >&, rapidjson::MemoryPoolAllocator<rapidjson::CrtAllocator>&) ()
#2  0x0000555555570bbf in main ()
upsj commented 2 years ago

Awesome, thanks. Like I thought, this is an issue in our own code. I'll try to reproduce

upsj commented 2 years ago

I managed to reproduce the issue, thanks for the report.