kokkos / kokkos-kernels

Kokkos C++ Performance Portability Programming Ecosystem: Math Kernels - Provides BLAS, Sparse BLAS and Graph Kernels
Other
311 stars 98 forks source link

SPGEMM -- Segmentation fault #2291

Closed Sachitt closed 2 months ago

Sachitt commented 3 months ago

Hi,

I followed the build instructions for kokkos-kernels with OpenMP support with perf_tests enabled and ran ./sparse_spgemm in kokkos-kernels/build/perf_test/sparse. For smaller datasets, such as cit-Patents, ca-GrQc, and roadnet-CA, this ran fine. However, I tried running the algorithm for the uk-2005 dataset and am receiving a segfault during the numeric phase (symbolic phase computes fine)

For each dataset, I am doing A*A. I am also running on aarch64 with openmp enabled.

Please let me know if you have any ideas about why this could be the case

Here is the output with --verbose flag:

Running on OpenMP backend. B is not provided or is the same as A. Multiplying AxA. m:39459925 n:39459925 k:39459925 SYMBOLIC PHASE Original Max Row Flops:348499 Original overall_flops Flops:71150593524 tOriginal Max Row Flop Calc Time:1.7172 COMPRESS MATRIX-B PHASE n:39459925 nnz:936364282 vector_size:1 team_size:1 chunk_size::16 shmem:16128 Compression Allocations:1.28e-07 COMPRESS -- thread_memory:16128 unit_memory:16 initial key size:1006 COMPRESS -- adjusted hashsize:512 shmem_key_size:1170 POOL chunksize:26810 num_chunks:18 min_hash_size:8192 max_row_nnz:5213 Pool Alloc MB:1.8409 Compression Count Kernel:0.865291 Compressed Max Row Flops:50700 Compressed Overall Row Flops:6963612531 Compressed Flops ratio:0.0978715 min_reduction:0.85 Compressed Max Row Flop Calc Time:1.70562 COMPRESS MATRIX-B overall time:3.74296

        Running SPGEMM_KK_MEMORY col_size:1233123 max_column_cut_off:250000
Pool Size (MB):19.4439 num_chunks:18 chunksize:283172
Pool Alloc Time:0.0321338
StructureC  thread_memory:16128 unit_memory:16 adjusted hashsize:512 adjusted shmem_key_size:1172 using 16124 of thread_memory: 16128
StructureC vector_size:1 team_size:1 chunk_size:16 shmem_size:16128
StructureC Kernel time:9.25698

C SIZE:382465606 Numeric PHASE HASH MODE initial PortableNumericCHASH -- thread_memory:16120 unit_memory:20 initial key size:805 initial PortableNumericCHASH -- team_memory:16128 unit_memory:20 initial team key size:805 Running SPGEMM_KK_MEMORY col_size:39459925 max_column_cut_off:250000 PortableNumericCHASH -- adjusted hashsize:512 thread_shmem_key_size:878 PortableNumericCHASH -- adjusted team hashsize:512 team_shmem_key_size:878 max_nnz: 128941 chunk_size:653229 min_hash_size:262144 concurrency:18 MyExecSpace().concurrency():18 numchunks:18 num_chunks:32 chunk_size:653229 overall_size:20903328 modular_num_chunks:31 Printing chunk_locks view

Printing data view -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ... ... ... -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 Pool Alloc Time:0.0244646 Pool Size(MB):44.8537 PortableNumericCHASH -- sizeof(scalar_t): 8 sizeof(nnz_lno_t): 4 suggested_team_size: 1 PortableNumericCHASH -- thread_memory:16128 unit_memory:20 initial key size:805 PortableNumericCHASH -- team shared_memory:16128 unit_memory:20 initial team key size:805 PortableNumericCHASH -- thread_memory:16128 unit_memory:20 resized key size:878 PortableNumericCHASH -- team shared_memory:16128 unit_memory:20 resized team key size:878 PortableNumericCHASH -- thread_memory:16128 unit_memory:20 initial key size:878 PortableNumericCHASH -- team_memory:16128 unit_memory:20 initial team key size:878 PortableNumericCHASH -- adjusted hashsize:512 thread_shmem_key_size:878 PortableNumericCHASH -- adjusted team hashsize:512 team_shmem_key_size:878 team_cuckoo_key_size:1024 team_cuckoo_hash_func:1023 max_first_level_hash_size:512 pow2_hash_size:262144 pow2_hash_func:262143 vector_size:1 chunk_size:16 suggested_team_size:1 Segmentation fault (core dumped)

cwpearson commented 3 months ago

Can you please provide the compiler used?

Sachitt commented 3 months ago

Hi, the compiler used is g++

brian-kelley commented 2 months ago

@Sachitt This matrix you're squaring has almost a billion entries, and by default KokkosKernels uses int (32-bit) to represent the row offsets in sparse matrices. So it's likely that the NNZ count of the C matrix, which here is printed as 382465606, has overflowed and so C's entries/values are allocated to the wrong size.

I'm not sure if the C result will fit in memory, but you can at least try

-DKokkosKernels_INST_OFFSET_SIZE_T=ON
-DKokkosKernels_INST_OFFSET_INT=OFF

to use size_t (64 bit) to represent row offsets and nonzero counts. If you do run out of memory you should get an informative message and not just a segfault.

cwpearson commented 2 months ago

@brian-kelley is correct, for uk-2005 the correct number of non-zeros is 8972400198 which is an overflow.

This thing determines nnz, but it uses the row map value type, which is 32 bit by default. https://github.com/kokkos/kokkos-kernels/blob/b2210058826672c8de838541a36f7b946ecbb79a/sparse/impl/KokkosSparse_spgemm_impl_symbolic.hpp#L1954-L1955

This is causing some downstream effects where the resulting allocation for the product matrix isn't large enough, leading to an out of bounds access. I think we can't even catch this case without adding an allocation or expensive runtime checks in one way or another.

cwpearson commented 2 months ago

This seems like it works with the following CMake options

-DKokkosKernels_INST_OFFSET_SIZE_T=ON
-DKokkosKernels_INST_OFFSET_INT=OFF
cwpearson commented 2 months ago

For posterity, here's a reproducer:

#! /bin/bash

set -eou pipefail

ROOT=$HOME/proj/kk-issue-2291
KOKKOS_SRC=$ROOT/kokkos
KOKKOS_BUILD=$ROOT/build-kokkos
KOKKOS_INSTALL=$ROOT/install-kokkos
KERNELS_SRC=$ROOT/kernels
KERNELS_BUILD=$ROOT/build-kernels

mkdir -p "$ROOT"
git clone https://github.com/kokkos/kokkos.git $KOKKOS_SRC || true
git clone git@github.com:cwpearson/kokkos-kernels.git $KERNELS_SRC || true

if [ ! -d $KOKKOS_INSTALL ]; then

    cmake -S $KOKKOS_SRC -B $KOKKOS_BUILD \
        -DCMAKE_BUILD_TYPE=RelWithDebInfo \
        -DKokkos_ENABLE_OPENMP=ON \
        -DCMAKE_INSTALL_PREFIX=$KOKKOS_INSTALL

    nice -n20 cmake --build $KOKKOS_BUILD --target install --parallel 52
fi

cmake -S$KERNELS_SRC -B $KERNELS_BUILD \
 -DCMAKE_BUILD_TYPE=RelWithDebInfo \
 -DKokkos_ROOT=$KOKKOS_INSTALL \
 -DKokkosKernels_ENABLE_TESTS=ON \
 -DKokkosKernels_ENABLE_BENCHMARK=ON \
 -DKokkosKernels_ENABLE_PERFTESTS=ON

nice -n20 cmake --build $KERNELS_BUILD/perf_test/sparse --target sparse_spgemm --parallel 52

if [ ! -d uk-2005 ]; then
    wget --continue https://suitesparse-collection-website.herokuapp.com/MM/LAW/uk-2005.tar.gz
    tar -xvf uk-2005.tar.gz
fi

$KERNELS_BUILD/perf_test/sparse/sparse_spgemm --amtx uk-2005/uk-2005.mtx --algorithm KKMEM --verbose --openmp 48