kokkos / kokkos-kernels

Kokkos C++ Performance Portability Programming Ecosystem: Math Kernels - Provides BLAS, Sparse BLAS and Graph Kernels
Other
294 stars 93 forks source link

Nightly cuda/12.0, cuda/11.8 unit test failures #1663

Open ndellingwood opened 1 year ago

ndellingwood commented 1 year ago

Sub-tests are failing in cuda/12.0 builds with the batched_dla_cuda and batched_gemm_cuda unit tests with error message cudaDeviceSynchronize() error( cudaErrorMisalignedAddress): misaligned address

batched_dla_cuda

00:37:42 3: [ RUN      ] Cuda.batched_scalar_serial_inverselu_dcomplex
00:37:42 3: cudaDeviceSynchronize() error( cudaErrorMisalignedAddress): misaligned address /home/jenkins/jenkins-new/workspace/KokkosKernels_KokkosDev2_CUDA120_GCC92_cpp17/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:132
00:37:42 3: Backtrace:
00:37:42 3:                         [0x675133]
00:37:42 3:                         [0x66d398]
00:37:42 3:                         [0x66d3cb]
00:37:42 3:                         [0x67b0fd]
00:37:42 3:                         [0x67bc91]
00:37:42 3:                         [0x651a65]
00:37:42 3:                         [0x5fd33a]
00:37:42 3:                         [0x43c787]
00:37:42 3:                         [0x64f34d]
00:37:42 3:                         [0x6432f3]
00:37:42 3:                         [0x6437a5]
00:37:42 3:                         [0x64398e]
00:37:42 3:                         [0x649a56]
00:37:42 3:                         [0x649cdb]
00:37:42 3:                         [0x413232]
00:37:42 3: __libc_start_main [0x7f7d99a18555]
00:37:42 3:                         [0x41acad]

batched_gemm_cuda

00:37:42 4: [ RUN      ] Cuda.batched_scalar_serial_gemm_nt_nt_dcomplex_dcomplex
00:37:42 4: cudaDeviceSynchronize() error( cudaErrorMisalignedAddress): misaligned address /home/jenkins/jenkins-new/workspace/KokkosKernels_KokkosDev2_CUDA120_GCC92_cpp17/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:132
00:37:42 4: Backtrace:
00:37:42 4:                         [0x7c22d3]
00:37:42 4:                         [0x7ba538]
00:37:42 4:                         [0x7ba56b]
00:37:42 4:                         [0x7c829d]
00:37:42 4:                         [0x7c8e31]
00:37:42 4:                         [0x79eb35]
00:37:42 4:                         [0x45e0b1]
00:37:42 4:                         [0x6fda73]
00:37:42 4:                         [0x6fe1d6]
00:37:42 4:                         [0x79c41d]
00:37:42 4:                         [0x7903c3]
00:37:42 4:                         [0x790875]
00:37:42 4:                         [0x790a5e]
00:37:42 4:                         [0x796b26]
00:37:42 4:                         [0x796dab]
00:37:42 4:                         [0x40e032]
00:37:42 4: __libc_start_main [0x7fb3be79b555]
00:37:42 4:                         [0x4170dd]

Reproducer (kokkos-dev-2):

source /projects/sems/modulefiles/utils/sems-archive-modules-init.sh ; module use /home/projects/x86-64/modulefiles/local
module purge
module load sems-archive-env sems-archive-cmake/3.17.1 sems-archive-gcc/9.2.0 cuda/12.0

$KOKKOSKERNELS_PATH/cm_generate_makefile.bash --with-devices=Cuda,Serial --arch=Volta70 --compiler=$KOKKOS_PATH/bin/nvcc_wrapper --cxxflags="-O3 -Wall -Wunused-parameter -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits -Wuninitialized " --cxxstandard="17" --with-scalars='double,complex_double' --with-ordinals=int --with-offsets=int,size_t --with-layouts=LayoutLeft --with-cuda-options=enable_lambda   --no-examples
e10harvey commented 11 months ago

@lucbv: Do you have any notes on this so I can pickup from where you left off or do you want to pair up?

e10harvey commented 11 months ago

Notes:

e10harvey commented 11 months ago

Relevant snippet from memcheck:

========= Invalid __local__ read of size 16 bytes
=========     at 0xdeadbeef in void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<Test::SerialInverseLU::Functor_BatchedSerialGemm<Kokkos::Cuda, Kokkos::View<Kokkos::complex<double> ***, Kokkos::LayoutLeft, Kokkos::Cuda>, Kokkos::complex<double>, Test::SerialInverseLU::ParamTag<KokkosBlas::Trans::NoTranspose, KokkosBlas::Trans::NoTranspose>, KokkosBlas::Algo::Level3::Blocked>, Kokkos::RangePolicy<Kokkos::Cuda, Test::SerialInverseLU::ParamTag<KokkosBlas::Trans::NoTranspose, KokkosBlas::Trans::NoTranspose>>, Kokkos::Cuda>>(T1

Note that all inverselu invalid reads come from the Blocked algo type.

e10harvey commented 11 months ago

Note: Cuda/12 wants all addresses 16-byte aligned but, in the BatchedSerialGemm Blocked implementation, we de-reference a address that is 8-byte aligned.

TODO: Print out pointer scalar types and their size as well as the starting addresses of views/subviews.

e10harvey commented 11 months ago

After more debugging I have determined that the misalignment is stemming from Functor_BatchedSerialGemm in Test_Batched_SerialInverseLU.hpp of an address outside the control of the parallel_for caller.

e10harvey commented 11 months ago

Given that the functor in question does not use any addresses that are violating 16-byte alignment nor do locals (&_alpha or &_beta) violate 16-byte alignment, I believe this is either a Kokkos Core or a compiler bug. Regardless of where the bug stems from, we should ask someone from Cuda or Kokkos Core to investigate further.

e10harvey commented 11 months ago

Here are more triaging results. Note that local memory can only be allocated by the compiler.

  1. Christian and I tried moving Scalar _alpha, _beta above the declaration of the _a, _b, _c locals in the functor class definition:
    template <typename DeviceType, typename ViewType, typename ScalarType,
          typename ParamTagType, typename AlgoTagType>
    struct Functor_BatchedSerialGemm {
    ScalarType _alpha, _beta;
    ViewType _a, _b, _c;

    This change resulted in passing tests in cuda/12.0.

e10harvey commented 11 months ago

The (register allocation bug?) still persists in cuda/12.2.

KokkosKernels HEAD SHA: 6c06bd024bbcb48b1ca6bef165bd13e73a3c3b44 Kokkos HEAD SHA: 7e299b4e25c42528e105379c3aa9a318056545ba

Local changes in KokkosKernels: kk_local_changes.txt

Local change in Kokkos: none.

module load sems-archive-env sems-archive-cmake/3.17.1 gcc/11 nvhpc/23.7
make -j16 KokkosKernels_batched_dla_cuda
$ ./batched/dense/unit_test/KokkosKernels_batched_dla_cuda --gtest_filter='*Cuda.batched_scalar_serial_inverselu_dcomplex*'
Note: Google Test filter = *Cuda.batched_scalar_serial_inverselu_dcomplex*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from Cuda
[ RUN      ] Cuda.batched_scalar_serial_inverselu_dcomplex
mark0
mark1
a0:0x7fd615f32480
a1:0x7fd615f32680
c0:0x7fd615f32a80
w:0x7fd615f32880
sizeof(AViewType::value_type):16
&_alpha:0x7ffc56298260
&_beta:0x7ffc56298270
mark2
i:0
mark0
mark1
a0:0x7fd615f32480
a1:0x7fd615f32680
c0:0x7fd615f32a80
w:0x7fd615f32880
sizeof(AViewType::value_type):16
&_alpha:0x7ffc56298260
&_beta:0x7ffc56298270
(CudaInternal::singleton().cuda_device_synchronize_wrapper()) error( cudaErrorMisalignedAddress): misaligned address /ascldap/users/eharvey/KOKKOS.base/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:153
Backtrace:
[0x6a0e23] 
[0x69b148] 
[0x69b17b] 
[0x6a6ee7] 
[0x6a786d] 
[0x67f4a5] 
[0x65961a] 
[0x47829d] 
[0x67cf8d] 
[0x672b98] 
[0x67332d] 
[0x673544] 
[0x6773d4] 
[0x672316] 
[0x4193c2] 
[0x7fd63cc7d555] __libc_start_main
[0x420b6d] 
Aborted (core dumped)
  1. Here is some additional sizeof and aligof information using cuda/12.2:
    
    $ ./batched/dense/unit_test/KokkosKernels_batched_dla_cuda --gtest_filter='*Cuda.batched_scalar_serial_inverselu_dcomplex*'
    Note: Google Test filter = *Cuda.batched_scalar_serial_inverselu_dcomplex*
    [==========] Running 1 test from 1 test case.
    [----------] Global test environment set-up.
    [----------] 1 test from Cuda
    [ RUN      ] Cuda.batched_scalar_serial_inverselu_dcomplex
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc68df32680
    c0:0x7fc68df32a80
    w:0x7fc68df32880
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    mark2
    i:0
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc68df32680
    c0:0x7fc68df32a80
    w:0x7fc68df32880
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:1
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc68df36680
    c0:0x7fc68df3ea80
    w:0x7fc68df3a880
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:2
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc68df42680
    c0:0x7fc68df62a80
    w:0x7fc68df52880
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:3
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc68df56680
    c0:0x7fc68df9ea80
    w:0x7fc68df7a880
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:4
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc68df72680
    c0:0x7fc67d400080
    w:0x7fc68dfb2880
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:5
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc68df96680
    c0:0x7fc67d464280
    w:0x7fc67d400080
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:6
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc67d400080
    c0:0x7fc67d520480
    w:0x7fc67d490280
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:7
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc67d400080
    c0:0x7fc67dc00080
    w:0x7fc67d4c4280
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:8
    mark0
    mark1
    a0:0x7fc67d400080
    a1:0x7fc67dc00080
    c0:0x7fc67e000080
    w:0x7fc67de00080
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:9
    mark0
    mark1
    a0:0x7fc67d400080
    a1:0x7fc67dc00080
    c0:0x7fc67e000080
    w:0x7fc67de00080
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc68df32680
    c0:0x7fc68df32a80
    w:0x7fc68df32880
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    mark2
    i:0
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc68df32680
    c0:0x7fc68df32a80
    w:0x7fc68df32880
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:1
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc68df36680
    c0:0x7fc68df3ea80
    w:0x7fc68df3a880
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:2
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc68df42680
    c0:0x7fc68df62a80
    w:0x7fc68df52880
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:3
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc68df56680
    c0:0x7fc68df9ea80
    w:0x7fc68df7a880
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:4
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc68df72680
    c0:0x7fc67d400080
    w:0x7fc68dfb2880
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:5
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc68df96680
    c0:0x7fc67d464280
    w:0x7fc67d400080
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:6
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc67d400080
    c0:0x7fc67d520480
    w:0x7fc67d490280
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:7
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc67d400080
    c0:0x7fc67dc00080
    w:0x7fc67d4c4280
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:8
    mark0
    mark1
    a0:0x7fc67d400080
    a1:0x7fc67dc00080
    c0:0x7fc67e000080
    w:0x7fc67de00080
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:9
    mark0
    mark1
    a0:0x7fc67d400080
    a1:0x7fc67dc00080
    c0:0x7fc67e000080
    w:0x7fc67de00080
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    [       OK ] Cuda.batched_scalar_serial_inverselu_dcomplex (116 ms)
    [----------] 1 test from Cuda (116 ms total)

[----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (116 ms total) [ PASSED ] 1 test.


NOTE: You have to comment out the following prints in the operator to trigger misalignment:

KOKKOS_INLINE_FUNCTION void operator()(const ParamTagType &, const int k) const { auto aa = Kokkos::subview(_a, k, Kokkos::ALL(), Kokkos::ALL()); auto bb = Kokkos::subview(_b, k, Kokkos::ALL(), Kokkos::ALL()); auto cc = Kokkos::subview(_c, k, Kokkos::ALL(), Kokkos::ALL());

/* if (k == 0) {
  printf("In Operator: alignof(decltype(*this)):%lu\n", alignof(decltype(*this)));
  printf("In Operator: alignof(decltype(ViewType)):%lu\n", alignof(ViewType));
  printf("In Operator: alignof(decltype(ScalarType)):%lu\n", alignof(ScalarType));
} */

$ ./batched/dense/unit_test/KokkosKernels_batched_dla_cuda --gtest_filter='Cuda.batched_scalar_serial_inverselu_dcomplex' Note: Google Test filter = Cuda.batched_scalar_serial_inverselu_dcomplex [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from Cuda [ RUN ] Cuda.batched_scalar_serial_inverselu_dcomplex mark0 mark1 a0:0x7ff6a7f32480 a1:0x7ff6a7f32680 c0:0x7ff6a7f32a80 w:0x7ff6a7f32880 sizeof(AViewType::value_type):16 sizeof(ViewType):40 alignof(decltype(this)):16 alignof(decltype(ViewType)):8 alignof(decltype(ScalarType)):16 &_alpha:0x7ffd8a1ea7b0 &_beta:0x7ffd8a1ea7c0 mark2 i:0 mark0 mark1 a0:0x7ff6a7f32480 a1:0x7ff6a7f32680 c0:0x7ff6a7f32a80 w:0x7ff6a7f32880 sizeof(AViewType::value_type):16 sizeof(ViewType):40 alignof(decltype(this)):16 alignof(decltype(ViewType)):8 alignof(decltype(ScalarType)):16 &_alpha:0x7ffd8a1ea7b0 &_beta:0x7ffd8a1ea7c0 (CudaInternal::singleton().cuda_device_synchronize_wrapper()) error( cudaErrorMisalignedAddress): misaligned address /ascldap/users/eharvey/KOKKOS.base/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:153 Backtrace: [0x6a11b3] [0x69b4d8] [0x69b50b] [0x6a7277] [0x6a7bfd] [0x67f835] [0x65996e] [0x47828d] [0x67d31d] [0x672f28] [0x6736bd] [0x6738d4] [0x677764] [0x6726a6] [0x419382] [0x7ff6d284e555] __libc_start_main [0x420b2d] Aborted (core dumped)

matt-stack commented 10 months ago

Hello, I am looking into this bug, and came across something I found strange. If you keep all the source for the test the same, but take out one Kokkos::abort, then it seems to not hit this error message. Does anyone have an idea why that would be?

change the abort here to just return 0; or comment it out entirely.

  if (!(m <= 2 && n <= 2))
   Kokkos::abort(
        "InnerGemmFixC<2,2>::serial_invoke, assert failure (m<=2 && n<=2)");

to

  if (!(m <= 2 && n <= 2)) return 0;
//    Kokkos::abort(
//        "InnerGemmFixC<2,2>::serial_invoke, assert failure (m<=2 && n<=2)");

And on my machine I get no error. Because of the lack of abort, am I just missing a cudaCheckLastError call or something like that? I cant tell yet if the Kokkos::abort is an issue here, or its causing me to miss the trigger for the bug, or its not printing the Cuda error. Though when I searched through the src for cuda_abort, it looks like it just prints the message you give it. @crtrott for vis

ndellingwood commented 7 months ago

Just to update, these two tests fail with cd8f77c1c61c45bd8071bc7870b55bd045a727c9 when enabling complex_double types in builds with c++20 enabled as well using for example cuda/12.0.0 + gcc/11.3.0

ndellingwood commented 7 months ago

If I configure with the option -DKokkos_ENABLE_COMPLEX_ALIGN=OFF then the tests posted above pass Adding @crtrott @dalg24 @masterleinad to the loop

ndellingwood commented 3 months ago

The same tests fail with cuda/11.8.0 when testing with cusparse and magma tpls enabled

ndellingwood commented 1 month ago

Updating the issue to confirm the same tests still fail with cuda/11.8.0, cuda/12.0 +/- c++20 on Weaver (Volta70+Power9) with SHA 32aa75a8f20ca88df64bde421c335b9fa6f68397

Configuration (Weaver, cuda/12.0 w/ c++20):

bsub -Is -n 1 -q rhel8 -gpu "num=1" bash

source /etc/profile.d/modules.sh
module load cmake git gcc/11.3.0 cuda/12.0.0

${KOKKOSKERNELS_PATH}/cm_generate_makefile.bash --with-cuda --with-serial --compiler=${KOKKOS_PATH}/bin/nvcc_wrapper --arch=Volta70,Power9 --with-cuda-options=enable_lambda --kokkos-path=${KOKKOS_PATH} --kokkoskernels-path=${KOKKOSKERNELS_PATH} --with-scalars='double,complex_double' --with-ordinals=int --with-offsets=int,size_t --cxxstandard=20

Test failures:

16:17:09 The following tests FAILED:
16:17:09      3 - batched_dla_cuda (Subprocess aborted)
16:17:09      4 - batched_gemm_cuda (Subprocess aborted)
ndellingwood commented 1 month ago

The tests above passed on kokkos-dev-2 with sems-cuda/12.4 + sems-gcc/13.2.0

lucbv commented 1 month ago

@ndellingwood so with cuda 12.4 we have the batched_dla_cuda and batched_gemm_cuda working correctly? Anything else failing on that platform?

ndellingwood commented 1 month ago

@ndellingwood so with cuda 12.4 we have the batched_dla_cuda and batched_gemm_cuda working correctly? Anything else failing on that platform?

@lucbv on kokkos-dev-2 the configuration here (with Power9 dropped), using sems-cuda/12.4, the tests passed 100%