Closed xwentian2020 closed 2 years ago
Could you please attach a backtrace of the segfault, as well as the output of the CMake configure call, so we see what compilers and library versions you use? We call the cuSPARSE SpGEMM functions internally, so this might be either an issue on our side or a bug in cuSPARSE.
---------------------------------------------------------------------------------------------------------
--
-- Summary of Configuration for Ginkgo (version 1.5.0 with tag develop, shortrev 6a9e45912)
-- Ginkgo configuration:
-- CMAKE_BUILD_TYPE: Release
-- BUILD_SHARED_LIBS: ON
-- CMAKE_INSTALL_PREFIX: /nfs/site/home/xuwentia/spmv/ginkgo_20221015/release_a100
-- PROJECT_SOURCE_DIR: /nfs/site/home/xuwentia/spmv/ginkgo_20221015/ginkgo
-- PROJECT_BINARY_DIR: /nfs/site/home/xuwentia/spmv/ginkgo_20221015/ginkgo/build/release_a100
-- CMAKE_CXX_COMPILER: GNU 11.2.0 on platform Linux x86_64
-- /usr/bin/c++
-- User configuration:
-- Enabled modules:
-- GINKGO_BUILD_OMP: OFF
-- GINKGO_BUILD_MPI: OFF
-- GINKGO_BUILD_REFERENCE: ON
-- GINKGO_BUILD_CUDA: ON
-- GINKGO_BUILD_HIP: OFF
-- GINKGO_BUILD_DPCPP: OFF
-- Enabled features:
-- GINKGO_MIXED_PRECISION: OFF
-- Tests, benchmarks and examples:
-- GINKGO_BUILD_TESTS: ON
-- GINKGO_FAST_TESTS: OFF
-- GINKGO_BUILD_EXAMPLES: ON
-- GINKGO_EXTLIB_EXAMPLE:
-- GINKGO_BUILD_BENCHMARKS: ON
-- GINKGO_BENCHMARK_ENABLE_TUNING: OFF
-- Documentation:
-- GINKGO_BUILD_DOC: OFF
-- GINKGO_VERBOSE_LEVEL: 1
--
---------------------------------------------------------------------------------------------------------
--
-- Compiled Modules
--
---------------------------------------------------------------------------------------------------------
--
-- The Core module is being compiled.
--
-- CMake related Core module variables:
-- BUILD_SHARED_LIBS: ON
-- CMAKE_C_COMPILER: /usr/bin/gcc
-- CMAKE_C_FLAGS_RELEASE: -O3 -DNDEBUG
-- CMAKE_CXX_COMPILER: /usr/bin/c++
-- CMAKE_CXX_FLAGS_RELEASE: -O3 -DNDEBUG
-- CMAKE_GENERATOR: Unix Makefiles
--
---------------------------------------------------------------------------------------------------------
--
-- The Reference module is being compiled.
--
-- CMake related Reference module variables:
-- GINKGO_BUILD_REFERENCE: ON
-- GINKGO_COMPILER_FLAGS: -Wpedantic
--
---------------------------------------------------------------------------------------------------------
--
-- The CUDA module is being compiled.
--
-- CMake related CUDA module variables:
-- GINKGO_CUDA_ARCHITECTURES: Ampere
-- GINKGO_CUDA_COMPILER_FLAGS: <empty>
-- GINKGO_CUDA_DEFAULT_HOST_COMPILER: OFF
-- GINKGO_CUDA_ARCH_FLAGS: --generate-code=arch=compute_80,code=sm_80;--generate-code=arch=compute_86,code=sm_86
-- CUDA variables:
-- CMAKE_CUDA_COMPILER: /usr/local/cuda/bin/nvcc
-- CMAKE_CUDA_COMPILER_VERSION: 11.1.105
-- CMAKE_CUDA_FLAGS_RELEASE: -O3 -DNDEBUG
-- CMAKE_CUDA_HOST_COMPILER: /usr/bin/c++
-- CUDA_INCLUDE_DIRS: /usr/local/cuda/targets/x86_64-linux/include
-- CUDA Libraries:
-- CUBLAS: /usr/local/cuda/targets/x86_64-linux/lib/stubs/libcublas.so
-- CUDA_RUNTIME_LIBS: /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so
-- CUSPARSE: /usr/local/cuda/targets/x86_64-linux/lib/stubs/libcusparse.so
--
---------------------------------------------------------------------------------------------------------
--
-- Developer Tools:
-- GINKGO_DEVEL_TOOLS: OFF
-- GINKGO_WITH_CLANG_TIDY: OFF
-- GINKGO_WITH_IWYU: OFF
-- GINKGO_CHECK_CIRCULAR_DEPS: OFF
-- GINKGO_WITH_CCACHE: ON
-- CCACHE:
-- CCACHE_PROGRAM: CCACHE_PROGRAM-NOTFOUND
-- CCACHE_DIR: <empty>
-- CCACHE_MAXSIZE: <empty>
-- PATH of other tools:
-- GINKGO_CLANG_TIDY_PATH:
-- GINKGO_IWYU_PATH:
--
---------------------------------------------------------------------------------------------------------
--
-- Components:
-- GINKGO_BUILD_HWLOC: OFF
-- HWLOC_VERSION:
-- HWLOC_LIBRARIES:
-- HWLOC_INCLUDE_DIRS:
--
-- Now, run cmake --build . to compile Ginkgo!
--
---------------------------------------------------------------------------------------------------------
I'm trying to reproduce this, but if it's possible for you, please attach a backtrace generated by loading the coredump into gdb
and calling backtrace
on it.
There seems to be something off with your configuration: CUDA 11.1 doesn't support GCC 11.2, so which host compiler is nvcc using?
I collected some information on the machine equipped with A100 80G.
xuwentia@ortce-a100-80G2:~/spmv/ginkgo_20221015/ginkgo/build/release_a100/benchmark/sparse_blas$
xuwentia@ortce-a100-80G2:~/spmv/ginkgo_20221015/ginkgo/build/release_a100/benchmark/sparse_blas$ ./sparse_blas_single --executor=cuda --gpu_timer=true --operations=spgemm --strategies=classical --spgemm_mode=dense </nfs/site/home/xuwentia/spmv/suitesparse_intel/mtxlist_vtune/mtxlist.json> sparse_blas_single_spgemm_dense.json
This is Ginkgo 1.5.0 (develop)
running with core module 1.5.0 (develop)
the reference module is 1.5.0 (develop)
the OpenMP module is not compiled
the CUDA module is 1.5.0 (develop)
the HIP module is not compiled
the DPCPP module is not compiled
Running on cuda(0)
Running with 2 warm iterations and 10 running iterations
The random seed for right hand sides is 42
The operations are spgemmRunning test case: {
"filename": "/nfs/site/home/xuwentia/spmv/suitesparse_intel/MM/Cunningham/m3plates/m3plates.mtx",
"sparse_blas": {}
}
Matrix is of size (11107, 11107), 6639
Segmentation fault (core dumped)
xuwentia@ortce-a100-80G2:~/spmv/ginkgo_20221015/ginkgo/build/release_a100/benchmark/sparse_blas$
xuwentia@ortce-a100-80G2:~/spmv/ginkgo_20221015/ginkgo/build/release_a100/benchmark/sparse_blas$ whereis cuda-gdb
cuda-gdb: /usr/local/cuda-11.7/bin/cuda-gdb
xuwentia@ortce-a100-80G2:~/spmv/ginkgo_20221015/ginkgo/build/release_a100/benchmark/sparse_blas$
xuwentia@ortce-a100-80G2:~/spmv/ginkgo_20221015/ginkgo/build/release_a100/benchmark/sparse_blas$ cuda-gdb --version
NVIDIA (R) CUDA Debugger
11.7 release
Portions Copyright (C) 2007-2022 NVIDIA Corporation
GNU gdb (GDB) 10.2
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
xuwentia@ortce-a100-80G2:~/spmv/ginkgo_20221015/ginkgo/build/release_a100/benchmark/sparse_blas$
xuwentia@ortce-a100-80G2:~/spmv/ginkgo_20221015/ginkgo/build/release_a100/benchmark/sparse_blas$
xuwentia@ortce-a100-80G2:~/spmv/ginkgo_20221015/ginkgo/build/release_a100/benchmark/sparse_blas$ cuda-gdb ./sparse_blas
NVIDIA (R) CUDA Debugger
11.7 release
Portions Copyright (C) 2007-2022 NVIDIA Corporation
GNU gdb (GDB) 10.2
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./sparse_blas...
(No debugging symbols found in ./sparse_blas)
(cuda-gdb) r --executor=cuda --gpu_timer=true --operations=spgemm --strategies=classical --spgemm_mode=dense </nfs/site/home/xuwentia/spmv/suitesparse_intel/mtxlist_vtune/mtxlist.json> sparse_blas_single_spgemm_dense.json
Starting program: /nfs/site/home/xuwentia/spmv/ginkgo_20221015/ginkgo/build/release_a100/benchmark/sparse_blas/sparse_blas --executor=cuda --gpu_timer=true --operations=spgemm --strategies=classical --spgemm_mode=dense </nfs/site/home/xuwentia/spmv/suitesparse_intel/mtxlist_vtune/mtxlist.json> sparse_blas_single_spgemm_dense.json
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/x86_64-linux-gnu/libthread_db.so.1".
[Detaching after fork from child process 2297838]
[New Thread 0x155508da4000 (LWP 2297844)]
[New Thread 0x155508ba3000 (LWP 2297845)]
This is Ginkgo 1.5.0 (develop)
running with core module 1.5.0 (develop)
the reference module is 1.5.0 (develop)
the OpenMP module is not compiled
the CUDA module is 1.5.0 (develop)
the HIP module is not compiled
the DPCPP module is not compiled
Running on cuda(0)
Running with 2 warm iterations and 10 running iterations
The random seed for right hand sides is 42
The operations are spgemmRunning test case: {
"filename": "/nfs/site/home/xuwentia/spmv/suitesparse_intel/MM/Cunningham/m3plates/m3plates.mtx",
"sparse_blas": {}
}
Matrix is of size (11107, 11107), 6639
Thread 1 "sparse_blas" received signal SIGSEGV, Segmentation fault.
0x000055555558f830 in SpgemmOperation::get_flops() const ()
(cuda-gdb) backtrace
#0 0x000055555558f830 in SpgemmOperation::get_flops() const ()
#1 0x0000555555575e28 in apply_sparse_blas(char const*, std::shared_ptr<gko::Executor>, gko::matrix::Csr<double, int> const*, rapidjson::GenericValue<rapidjson::UTF8<char>, rapidjson::MemoryPoolAllocator<rapidjson::CrtAllocator> >&, rapidjson::MemoryPoolAllocator<rapidjson::CrtAllocator>&) ()
#2 0x0000555555570bbf in main ()
Awesome, thanks. Like I thought, this is an issue in our own code. I'll try to reproduce
I managed to reproduce the issue, thanks for the report.
Tests had been made on A100 80G for sparse matrix and matrix multiplication and faults were got for all the modes (normal, transposed, sparse, dense) in spgemm, as shown below.
In the tests, the command was ./ginkgo/build/release_a100/benchmark/sparse_blas/sparse_blas_single --executor=cuda --gpu_timer=true --operations=spgemm,spgeam,transpose --strategies=classical --spgemm_mode=normal spgemm_results.json