kokkos / kokkos-kernels

Kokkos C++ Performance Portability Programming Ecosystem: Math Kernels - Provides BLAS, Sparse BLAS and Graph Kernels
Other
313 stars 98 forks source link

Nightly cuda/12.0, cuda/11.8 unit test failures #1663

Open ndellingwood opened 1 year ago

ndellingwood commented 1 year ago

Sub-tests are failing in cuda/12.0 builds with the batched_dla_cuda and batched_gemm_cuda unit tests with error message cudaDeviceSynchronize() error( cudaErrorMisalignedAddress): misaligned address

batched_dla_cuda

00:37:42 3: [ RUN      ] Cuda.batched_scalar_serial_inverselu_dcomplex
00:37:42 3: cudaDeviceSynchronize() error( cudaErrorMisalignedAddress): misaligned address /home/jenkins/jenkins-new/workspace/KokkosKernels_KokkosDev2_CUDA120_GCC92_cpp17/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:132
00:37:42 3: Backtrace:
00:37:42 3:                         [0x675133]
00:37:42 3:                         [0x66d398]
00:37:42 3:                         [0x66d3cb]
00:37:42 3:                         [0x67b0fd]
00:37:42 3:                         [0x67bc91]
00:37:42 3:                         [0x651a65]
00:37:42 3:                         [0x5fd33a]
00:37:42 3:                         [0x43c787]
00:37:42 3:                         [0x64f34d]
00:37:42 3:                         [0x6432f3]
00:37:42 3:                         [0x6437a5]
00:37:42 3:                         [0x64398e]
00:37:42 3:                         [0x649a56]
00:37:42 3:                         [0x649cdb]
00:37:42 3:                         [0x413232]
00:37:42 3: __libc_start_main [0x7f7d99a18555]
00:37:42 3:                         [0x41acad]

batched_gemm_cuda

00:37:42 4: [ RUN      ] Cuda.batched_scalar_serial_gemm_nt_nt_dcomplex_dcomplex
00:37:42 4: cudaDeviceSynchronize() error( cudaErrorMisalignedAddress): misaligned address /home/jenkins/jenkins-new/workspace/KokkosKernels_KokkosDev2_CUDA120_GCC92_cpp17/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:132
00:37:42 4: Backtrace:
00:37:42 4:                         [0x7c22d3]
00:37:42 4:                         [0x7ba538]
00:37:42 4:                         [0x7ba56b]
00:37:42 4:                         [0x7c829d]
00:37:42 4:                         [0x7c8e31]
00:37:42 4:                         [0x79eb35]
00:37:42 4:                         [0x45e0b1]
00:37:42 4:                         [0x6fda73]
00:37:42 4:                         [0x6fe1d6]
00:37:42 4:                         [0x79c41d]
00:37:42 4:                         [0x7903c3]
00:37:42 4:                         [0x790875]
00:37:42 4:                         [0x790a5e]
00:37:42 4:                         [0x796b26]
00:37:42 4:                         [0x796dab]
00:37:42 4:                         [0x40e032]
00:37:42 4: __libc_start_main [0x7fb3be79b555]
00:37:42 4:                         [0x4170dd]

Reproducer (kokkos-dev-2):

source /projects/sems/modulefiles/utils/sems-archive-modules-init.sh ; module use /home/projects/x86-64/modulefiles/local
module purge
module load sems-archive-env sems-archive-cmake/3.17.1 sems-archive-gcc/9.2.0 cuda/12.0

$KOKKOSKERNELS_PATH/cm_generate_makefile.bash --with-devices=Cuda,Serial --arch=Volta70 --compiler=$KOKKOS_PATH/bin/nvcc_wrapper --cxxflags="-O3 -Wall -Wunused-parameter -Wshadow -pedantic -Werror -Wsign-compare -Wtype-limits -Wuninitialized " --cxxstandard="17" --with-scalars='double,complex_double' --with-ordinals=int --with-offsets=int,size_t --with-layouts=LayoutLeft --with-cuda-options=enable_lambda   --no-examples
e10harvey commented 1 year ago

@lucbv: Do you have any notes on this so I can pickup from where you left off or do you want to pair up?

e10harvey commented 1 year ago

Notes:

e10harvey commented 1 year ago

Relevant snippet from memcheck:

========= Invalid __local__ read of size 16 bytes
=========     at 0xdeadbeef in void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<Test::SerialInverseLU::Functor_BatchedSerialGemm<Kokkos::Cuda, Kokkos::View<Kokkos::complex<double> ***, Kokkos::LayoutLeft, Kokkos::Cuda>, Kokkos::complex<double>, Test::SerialInverseLU::ParamTag<KokkosBlas::Trans::NoTranspose, KokkosBlas::Trans::NoTranspose>, KokkosBlas::Algo::Level3::Blocked>, Kokkos::RangePolicy<Kokkos::Cuda, Test::SerialInverseLU::ParamTag<KokkosBlas::Trans::NoTranspose, KokkosBlas::Trans::NoTranspose>>, Kokkos::Cuda>>(T1

Note that all inverselu invalid reads come from the Blocked algo type.

e10harvey commented 1 year ago

Note: Cuda/12 wants all addresses 16-byte aligned but, in the BatchedSerialGemm Blocked implementation, we de-reference a address that is 8-byte aligned.

TODO: Print out pointer scalar types and their size as well as the starting addresses of views/subviews.

e10harvey commented 1 year ago

After more debugging I have determined that the misalignment is stemming from Functor_BatchedSerialGemm in Test_Batched_SerialInverseLU.hpp of an address outside the control of the parallel_for caller.

e10harvey commented 1 year ago

Given that the functor in question does not use any addresses that are violating 16-byte alignment nor do locals (&_alpha or &_beta) violate 16-byte alignment, I believe this is either a Kokkos Core or a compiler bug. Regardless of where the bug stems from, we should ask someone from Cuda or Kokkos Core to investigate further.

e10harvey commented 1 year ago

Here are more triaging results. Note that local memory can only be allocated by the compiler.

  1. Christian and I tried moving Scalar _alpha, _beta above the declaration of the _a, _b, _c locals in the functor class definition:
    template <typename DeviceType, typename ViewType, typename ScalarType,
          typename ParamTagType, typename AlgoTagType>
    struct Functor_BatchedSerialGemm {
    ScalarType _alpha, _beta;
    ViewType _a, _b, _c;

    This change resulted in passing tests in cuda/12.0.

e10harvey commented 1 year ago

The (register allocation bug?) still persists in cuda/12.2.

KokkosKernels HEAD SHA: 6c06bd024bbcb48b1ca6bef165bd13e73a3c3b44 Kokkos HEAD SHA: 7e299b4e25c42528e105379c3aa9a318056545ba

Local changes in KokkosKernels: kk_local_changes.txt

Local change in Kokkos: none.

module load sems-archive-env sems-archive-cmake/3.17.1 gcc/11 nvhpc/23.7
make -j16 KokkosKernels_batched_dla_cuda
$ ./batched/dense/unit_test/KokkosKernels_batched_dla_cuda --gtest_filter='*Cuda.batched_scalar_serial_inverselu_dcomplex*'
Note: Google Test filter = *Cuda.batched_scalar_serial_inverselu_dcomplex*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from Cuda
[ RUN      ] Cuda.batched_scalar_serial_inverselu_dcomplex
mark0
mark1
a0:0x7fd615f32480
a1:0x7fd615f32680
c0:0x7fd615f32a80
w:0x7fd615f32880
sizeof(AViewType::value_type):16
&_alpha:0x7ffc56298260
&_beta:0x7ffc56298270
mark2
i:0
mark0
mark1
a0:0x7fd615f32480
a1:0x7fd615f32680
c0:0x7fd615f32a80
w:0x7fd615f32880
sizeof(AViewType::value_type):16
&_alpha:0x7ffc56298260
&_beta:0x7ffc56298270
(CudaInternal::singleton().cuda_device_synchronize_wrapper()) error( cudaErrorMisalignedAddress): misaligned address /ascldap/users/eharvey/KOKKOS.base/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:153
Backtrace:
[0x6a0e23] 
[0x69b148] 
[0x69b17b] 
[0x6a6ee7] 
[0x6a786d] 
[0x67f4a5] 
[0x65961a] 
[0x47829d] 
[0x67cf8d] 
[0x672b98] 
[0x67332d] 
[0x673544] 
[0x6773d4] 
[0x672316] 
[0x4193c2] 
[0x7fd63cc7d555] __libc_start_main
[0x420b6d] 
Aborted (core dumped)
  1. Here is some additional sizeof and aligof information using cuda/12.2:
    
    $ ./batched/dense/unit_test/KokkosKernels_batched_dla_cuda --gtest_filter='*Cuda.batched_scalar_serial_inverselu_dcomplex*'
    Note: Google Test filter = *Cuda.batched_scalar_serial_inverselu_dcomplex*
    [==========] Running 1 test from 1 test case.
    [----------] Global test environment set-up.
    [----------] 1 test from Cuda
    [ RUN      ] Cuda.batched_scalar_serial_inverselu_dcomplex
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc68df32680
    c0:0x7fc68df32a80
    w:0x7fc68df32880
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    mark2
    i:0
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc68df32680
    c0:0x7fc68df32a80
    w:0x7fc68df32880
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:1
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc68df36680
    c0:0x7fc68df3ea80
    w:0x7fc68df3a880
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:2
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc68df42680
    c0:0x7fc68df62a80
    w:0x7fc68df52880
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:3
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc68df56680
    c0:0x7fc68df9ea80
    w:0x7fc68df7a880
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:4
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc68df72680
    c0:0x7fc67d400080
    w:0x7fc68dfb2880
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:5
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc68df96680
    c0:0x7fc67d464280
    w:0x7fc67d400080
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:6
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc67d400080
    c0:0x7fc67d520480
    w:0x7fc67d490280
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:7
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc67d400080
    c0:0x7fc67dc00080
    w:0x7fc67d4c4280
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:8
    mark0
    mark1
    a0:0x7fc67d400080
    a1:0x7fc67dc00080
    c0:0x7fc67e000080
    w:0x7fc67de00080
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:9
    mark0
    mark1
    a0:0x7fc67d400080
    a1:0x7fc67dc00080
    c0:0x7fc67e000080
    w:0x7fc67de00080
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc68df32680
    c0:0x7fc68df32a80
    w:0x7fc68df32880
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    mark2
    i:0
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc68df32680
    c0:0x7fc68df32a80
    w:0x7fc68df32880
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:1
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc68df36680
    c0:0x7fc68df3ea80
    w:0x7fc68df3a880
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:2
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc68df42680
    c0:0x7fc68df62a80
    w:0x7fc68df52880
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:3
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc68df56680
    c0:0x7fc68df9ea80
    w:0x7fc68df7a880
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:4
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc68df72680
    c0:0x7fc67d400080
    w:0x7fc68dfb2880
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:5
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc68df96680
    c0:0x7fc67d464280
    w:0x7fc67d400080
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:6
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc67d400080
    c0:0x7fc67d520480
    w:0x7fc67d490280
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:7
    mark0
    mark1
    a0:0x7fc68df32480
    a1:0x7fc67d400080
    c0:0x7fc67dc00080
    w:0x7fc67d4c4280
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:8
    mark0
    mark1
    a0:0x7fc67d400080
    a1:0x7fc67dc00080
    c0:0x7fc67e000080
    w:0x7fc67de00080
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    i:9
    mark0
    mark1
    a0:0x7fc67d400080
    a1:0x7fc67dc00080
    c0:0x7fc67e000080
    w:0x7fc67de00080
    sizeof(AViewType::value_type):16
    sizeof(ViewType):40
    alignof(decltype(*this)):16
    alignof(decltype(ViewType)):8
    alignof(decltype(ScalarType)):16
    &_alpha:0x7ffdfcd472f0
    &_beta:0x7ffdfcd47300
    In Operator: alignof(decltype(*this)):16
    In Operator: alignof(decltype(ViewType)):8
    In Operator: alignof(decltype(ScalarType)):16
    mark2
    [       OK ] Cuda.batched_scalar_serial_inverselu_dcomplex (116 ms)
    [----------] 1 test from Cuda (116 ms total)

[----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (116 ms total) [ PASSED ] 1 test.


NOTE: You have to comment out the following prints in the operator to trigger misalignment:

KOKKOS_INLINE_FUNCTION void operator()(const ParamTagType &, const int k) const { auto aa = Kokkos::subview(_a, k, Kokkos::ALL(), Kokkos::ALL()); auto bb = Kokkos::subview(_b, k, Kokkos::ALL(), Kokkos::ALL()); auto cc = Kokkos::subview(_c, k, Kokkos::ALL(), Kokkos::ALL());

/* if (k == 0) {
  printf("In Operator: alignof(decltype(*this)):%lu\n", alignof(decltype(*this)));
  printf("In Operator: alignof(decltype(ViewType)):%lu\n", alignof(ViewType));
  printf("In Operator: alignof(decltype(ScalarType)):%lu\n", alignof(ScalarType));
} */

$ ./batched/dense/unit_test/KokkosKernels_batched_dla_cuda --gtest_filter='Cuda.batched_scalar_serial_inverselu_dcomplex' Note: Google Test filter = Cuda.batched_scalar_serial_inverselu_dcomplex [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from Cuda [ RUN ] Cuda.batched_scalar_serial_inverselu_dcomplex mark0 mark1 a0:0x7ff6a7f32480 a1:0x7ff6a7f32680 c0:0x7ff6a7f32a80 w:0x7ff6a7f32880 sizeof(AViewType::value_type):16 sizeof(ViewType):40 alignof(decltype(this)):16 alignof(decltype(ViewType)):8 alignof(decltype(ScalarType)):16 &_alpha:0x7ffd8a1ea7b0 &_beta:0x7ffd8a1ea7c0 mark2 i:0 mark0 mark1 a0:0x7ff6a7f32480 a1:0x7ff6a7f32680 c0:0x7ff6a7f32a80 w:0x7ff6a7f32880 sizeof(AViewType::value_type):16 sizeof(ViewType):40 alignof(decltype(this)):16 alignof(decltype(ViewType)):8 alignof(decltype(ScalarType)):16 &_alpha:0x7ffd8a1ea7b0 &_beta:0x7ffd8a1ea7c0 (CudaInternal::singleton().cuda_device_synchronize_wrapper()) error( cudaErrorMisalignedAddress): misaligned address /ascldap/users/eharvey/KOKKOS.base/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:153 Backtrace: [0x6a11b3] [0x69b4d8] [0x69b50b] [0x6a7277] [0x6a7bfd] [0x67f835] [0x65996e] [0x47828d] [0x67d31d] [0x672f28] [0x6736bd] [0x6738d4] [0x677764] [0x6726a6] [0x419382] [0x7ff6d284e555] __libc_start_main [0x420b2d] Aborted (core dumped)

matt-stack commented 1 year ago

Hello, I am looking into this bug, and came across something I found strange. If you keep all the source for the test the same, but take out one Kokkos::abort, then it seems to not hit this error message. Does anyone have an idea why that would be?

change the abort here to just return 0; or comment it out entirely.

  if (!(m <= 2 && n <= 2))
   Kokkos::abort(
        "InnerGemmFixC<2,2>::serial_invoke, assert failure (m<=2 && n<=2)");

to

  if (!(m <= 2 && n <= 2)) return 0;
//    Kokkos::abort(
//        "InnerGemmFixC<2,2>::serial_invoke, assert failure (m<=2 && n<=2)");

And on my machine I get no error. Because of the lack of abort, am I just missing a cudaCheckLastError call or something like that? I cant tell yet if the Kokkos::abort is an issue here, or its causing me to miss the trigger for the bug, or its not printing the Cuda error. Though when I searched through the src for cuda_abort, it looks like it just prints the message you give it. @crtrott for vis

ndellingwood commented 11 months ago

Just to update, these two tests fail with cd8f77c1c61c45bd8071bc7870b55bd045a727c9 when enabling complex_double types in builds with c++20 enabled as well using for example cuda/12.0.0 + gcc/11.3.0

ndellingwood commented 11 months ago

If I configure with the option -DKokkos_ENABLE_COMPLEX_ALIGN=OFF then the tests posted above pass Adding @crtrott @dalg24 @masterleinad to the loop

ndellingwood commented 7 months ago

The same tests fail with cuda/11.8.0 when testing with cusparse and magma tpls enabled

ndellingwood commented 5 months ago

Updating the issue to confirm the same tests still fail with cuda/11.8.0, cuda/12.0 +/- c++20 on Weaver (Volta70+Power9) with SHA 32aa75a8f20ca88df64bde421c335b9fa6f68397

Configuration (Weaver, cuda/12.0 w/ c++20):

bsub -Is -n 1 -q rhel8 -gpu "num=1" bash

source /etc/profile.d/modules.sh
module load cmake git gcc/11.3.0 cuda/12.0.0

${KOKKOSKERNELS_PATH}/cm_generate_makefile.bash --with-cuda --with-serial --compiler=${KOKKOS_PATH}/bin/nvcc_wrapper --arch=Volta70,Power9 --with-cuda-options=enable_lambda --kokkos-path=${KOKKOS_PATH} --kokkoskernels-path=${KOKKOSKERNELS_PATH} --with-scalars='double,complex_double' --with-ordinals=int --with-offsets=int,size_t --cxxstandard=20

Test failures:

16:17:09 The following tests FAILED:
16:17:09      3 - batched_dla_cuda (Subprocess aborted)
16:17:09      4 - batched_gemm_cuda (Subprocess aborted)
ndellingwood commented 5 months ago

The tests above passed on kokkos-dev-2 with sems-cuda/12.4 + sems-gcc/13.2.0

lucbv commented 5 months ago

@ndellingwood so with cuda 12.4 we have the batched_dla_cuda and batched_gemm_cuda working correctly? Anything else failing on that platform?

ndellingwood commented 5 months ago

@ndellingwood so with cuda 12.4 we have the batched_dla_cuda and batched_gemm_cuda working correctly? Anything else failing on that platform?

@lucbv on kokkos-dev-2 the configuration here (with Power9 dropped), using sems-cuda/12.4, the tests passed 100%