Open ndellingwood opened 1 year ago
@lucbv: Do you have any notes on this so I can pickup from where you left off or do you want to pair up?
Notes:
issue1663
build directory.Relevant snippet from memcheck:
========= Invalid __local__ read of size 16 bytes
========= at 0xdeadbeef in void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<Test::SerialInverseLU::Functor_BatchedSerialGemm<Kokkos::Cuda, Kokkos::View<Kokkos::complex<double> ***, Kokkos::LayoutLeft, Kokkos::Cuda>, Kokkos::complex<double>, Test::SerialInverseLU::ParamTag<KokkosBlas::Trans::NoTranspose, KokkosBlas::Trans::NoTranspose>, KokkosBlas::Algo::Level3::Blocked>, Kokkos::RangePolicy<Kokkos::Cuda, Test::SerialInverseLU::ParamTag<KokkosBlas::Trans::NoTranspose, KokkosBlas::Trans::NoTranspose>>, Kokkos::Cuda>>(T1
Note that all inverselu invalid reads come from the Blocked algo type.
Note: Cuda/12 wants all addresses 16-byte aligned but, in the BatchedSerialGemm Blocked implementation, we de-reference a address that is 8-byte aligned.
TODO: Print out pointer scalar types and their size as well as the starting addresses of views/subviews.
After more debugging I have determined that the misalignment is stemming from Functor_BatchedSerialGemm
in Test_Batched_SerialInverseLU.hpp of an address outside the control of the parallel_for caller.
Given that the functor in question does not use any addresses that are violating 16-byte alignment nor do locals (&_alpha
or &_beta
) violate 16-byte alignment, I believe this is either a Kokkos Core or a compiler bug. Regardless of where the bug stems from, we should ask someone from Cuda or Kokkos Core to investigate further.
Here are more triaging results. Note that local memory can only be allocated by the compiler.
Scalar _alpha, _beta
above the declaration of the _a, _b, _c locals in the functor class definition:
template <typename DeviceType, typename ViewType, typename ScalarType,
typename ParamTagType, typename AlgoTagType>
struct Functor_BatchedSerialGemm {
ScalarType _alpha, _beta;
ViewType _a, _b, _c;
This change resulted in passing tests in cuda/12.0.
The (register allocation bug?) still persists in cuda/12.2.
KokkosKernels HEAD SHA: 6c06bd024bbcb48b1ca6bef165bd13e73a3c3b44 Kokkos HEAD SHA: 7e299b4e25c42528e105379c3aa9a318056545ba
Local changes in KokkosKernels: kk_local_changes.txt
Local change in Kokkos: none.
module load sems-archive-env sems-archive-cmake/3.17.1 gcc/11 nvhpc/23.7
make -j16 KokkosKernels_batched_dla_cuda
$ ./batched/dense/unit_test/KokkosKernels_batched_dla_cuda --gtest_filter='*Cuda.batched_scalar_serial_inverselu_dcomplex*'
Note: Google Test filter = *Cuda.batched_scalar_serial_inverselu_dcomplex*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from Cuda
[ RUN ] Cuda.batched_scalar_serial_inverselu_dcomplex
mark0
mark1
a0:0x7fd615f32480
a1:0x7fd615f32680
c0:0x7fd615f32a80
w:0x7fd615f32880
sizeof(AViewType::value_type):16
&_alpha:0x7ffc56298260
&_beta:0x7ffc56298270
mark2
i:0
mark0
mark1
a0:0x7fd615f32480
a1:0x7fd615f32680
c0:0x7fd615f32a80
w:0x7fd615f32880
sizeof(AViewType::value_type):16
&_alpha:0x7ffc56298260
&_beta:0x7ffc56298270
(CudaInternal::singleton().cuda_device_synchronize_wrapper()) error( cudaErrorMisalignedAddress): misaligned address /ascldap/users/eharvey/KOKKOS.base/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:153
Backtrace:
[0x6a0e23]
[0x69b148]
[0x69b17b]
[0x6a6ee7]
[0x6a786d]
[0x67f4a5]
[0x65961a]
[0x47829d]
[0x67cf8d]
[0x672b98]
[0x67332d]
[0x673544]
[0x6773d4]
[0x672316]
[0x4193c2]
[0x7fd63cc7d555] __libc_start_main
[0x420b6d]
Aborted (core dumped)
sizeof
and aligof
information using cuda/12.2:
$ ./batched/dense/unit_test/KokkosKernels_batched_dla_cuda --gtest_filter='*Cuda.batched_scalar_serial_inverselu_dcomplex*'
Note: Google Test filter = *Cuda.batched_scalar_serial_inverselu_dcomplex*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from Cuda
[ RUN ] Cuda.batched_scalar_serial_inverselu_dcomplex
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df32680
c0:0x7fc68df32a80
w:0x7fc68df32880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
mark2
i:0
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df32680
c0:0x7fc68df32a80
w:0x7fc68df32880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:1
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df36680
c0:0x7fc68df3ea80
w:0x7fc68df3a880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:2
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df42680
c0:0x7fc68df62a80
w:0x7fc68df52880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:3
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df56680
c0:0x7fc68df9ea80
w:0x7fc68df7a880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:4
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df72680
c0:0x7fc67d400080
w:0x7fc68dfb2880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:5
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df96680
c0:0x7fc67d464280
w:0x7fc67d400080
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:6
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc67d400080
c0:0x7fc67d520480
w:0x7fc67d490280
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:7
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc67d400080
c0:0x7fc67dc00080
w:0x7fc67d4c4280
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:8
mark0
mark1
a0:0x7fc67d400080
a1:0x7fc67dc00080
c0:0x7fc67e000080
w:0x7fc67de00080
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:9
mark0
mark1
a0:0x7fc67d400080
a1:0x7fc67dc00080
c0:0x7fc67e000080
w:0x7fc67de00080
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df32680
c0:0x7fc68df32a80
w:0x7fc68df32880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
mark2
i:0
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df32680
c0:0x7fc68df32a80
w:0x7fc68df32880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:1
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df36680
c0:0x7fc68df3ea80
w:0x7fc68df3a880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:2
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df42680
c0:0x7fc68df62a80
w:0x7fc68df52880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:3
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df56680
c0:0x7fc68df9ea80
w:0x7fc68df7a880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:4
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df72680
c0:0x7fc67d400080
w:0x7fc68dfb2880
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:5
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc68df96680
c0:0x7fc67d464280
w:0x7fc67d400080
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:6
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc67d400080
c0:0x7fc67d520480
w:0x7fc67d490280
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:7
mark0
mark1
a0:0x7fc68df32480
a1:0x7fc67d400080
c0:0x7fc67dc00080
w:0x7fc67d4c4280
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:8
mark0
mark1
a0:0x7fc67d400080
a1:0x7fc67dc00080
c0:0x7fc67e000080
w:0x7fc67de00080
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
i:9
mark0
mark1
a0:0x7fc67d400080
a1:0x7fc67dc00080
c0:0x7fc67e000080
w:0x7fc67de00080
sizeof(AViewType::value_type):16
sizeof(ViewType):40
alignof(decltype(*this)):16
alignof(decltype(ViewType)):8
alignof(decltype(ScalarType)):16
&_alpha:0x7ffdfcd472f0
&_beta:0x7ffdfcd47300
In Operator: alignof(decltype(*this)):16
In Operator: alignof(decltype(ViewType)):8
In Operator: alignof(decltype(ScalarType)):16
mark2
[ OK ] Cuda.batched_scalar_serial_inverselu_dcomplex (116 ms)
[----------] 1 test from Cuda (116 ms total)
[----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (116 ms total) [ PASSED ] 1 test.
NOTE: You have to comment out the following prints in the operator to trigger misalignment:
KOKKOS_INLINE_FUNCTION void operator()(const ParamTagType &, const int k) const { auto aa = Kokkos::subview(_a, k, Kokkos::ALL(), Kokkos::ALL()); auto bb = Kokkos::subview(_b, k, Kokkos::ALL(), Kokkos::ALL()); auto cc = Kokkos::subview(_c, k, Kokkos::ALL(), Kokkos::ALL());
/* if (k == 0) {
printf("In Operator: alignof(decltype(*this)):%lu\n", alignof(decltype(*this)));
printf("In Operator: alignof(decltype(ViewType)):%lu\n", alignof(ViewType));
printf("In Operator: alignof(decltype(ScalarType)):%lu\n", alignof(ScalarType));
} */
$ ./batched/dense/unit_test/KokkosKernels_batched_dla_cuda --gtest_filter='Cuda.batched_scalar_serial_inverselu_dcomplex' Note: Google Test filter = Cuda.batched_scalar_serial_inverselu_dcomplex [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from Cuda [ RUN ] Cuda.batched_scalar_serial_inverselu_dcomplex mark0 mark1 a0:0x7ff6a7f32480 a1:0x7ff6a7f32680 c0:0x7ff6a7f32a80 w:0x7ff6a7f32880 sizeof(AViewType::value_type):16 sizeof(ViewType):40 alignof(decltype(this)):16 alignof(decltype(ViewType)):8 alignof(decltype(ScalarType)):16 &_alpha:0x7ffd8a1ea7b0 &_beta:0x7ffd8a1ea7c0 mark2 i:0 mark0 mark1 a0:0x7ff6a7f32480 a1:0x7ff6a7f32680 c0:0x7ff6a7f32a80 w:0x7ff6a7f32880 sizeof(AViewType::value_type):16 sizeof(ViewType):40 alignof(decltype(this)):16 alignof(decltype(ViewType)):8 alignof(decltype(ScalarType)):16 &_alpha:0x7ffd8a1ea7b0 &_beta:0x7ffd8a1ea7c0 (CudaInternal::singleton().cuda_device_synchronize_wrapper()) error( cudaErrorMisalignedAddress): misaligned address /ascldap/users/eharvey/KOKKOS.base/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:153 Backtrace: [0x6a11b3] [0x69b4d8] [0x69b50b] [0x6a7277] [0x6a7bfd] [0x67f835] [0x65996e] [0x47828d] [0x67d31d] [0x672f28] [0x6736bd] [0x6738d4] [0x677764] [0x6726a6] [0x419382] [0x7ff6d284e555] __libc_start_main [0x420b2d] Aborted (core dumped)
Hello, I am looking into this bug, and came across something I found strange. If you keep all the source for the test the same, but take out one Kokkos::abort, then it seems to not hit this error message. Does anyone have an idea why that would be?
change the abort here to just return 0; or comment it out entirely.
if (!(m <= 2 && n <= 2))
Kokkos::abort(
"InnerGemmFixC<2,2>::serial_invoke, assert failure (m<=2 && n<=2)");
to
if (!(m <= 2 && n <= 2)) return 0;
// Kokkos::abort(
// "InnerGemmFixC<2,2>::serial_invoke, assert failure (m<=2 && n<=2)");
And on my machine I get no error. Because of the lack of abort, am I just missing a cudaCheckLastError call or something like that? I cant tell yet if the Kokkos::abort is an issue here, or its causing me to miss the trigger for the bug, or its not printing the Cuda error. Though when I searched through the src for cuda_abort, it looks like it just prints the message you give it. @crtrott for vis
Just to update, these two tests fail with cd8f77c1c61c45bd8071bc7870b55bd045a727c9 when enabling complex_double types in builds with c++20 enabled as well using for example cuda/12.0.0 + gcc/11.3.0
If I configure with the option -DKokkos_ENABLE_COMPLEX_ALIGN=OFF
then the tests posted above pass
Adding @crtrott @dalg24 @masterleinad to the loop
The same tests fail with cuda/11.8.0 when testing with cusparse and magma tpls enabled
Updating the issue to confirm the same tests still fail with cuda/11.8.0, cuda/12.0 +/- c++20 on Weaver (Volta70+Power9) with SHA 32aa75a8f20ca88df64bde421c335b9fa6f68397
Configuration (Weaver, cuda/12.0 w/ c++20):
bsub -Is -n 1 -q rhel8 -gpu "num=1" bash
source /etc/profile.d/modules.sh
module load cmake git gcc/11.3.0 cuda/12.0.0
${KOKKOSKERNELS_PATH}/cm_generate_makefile.bash --with-cuda --with-serial --compiler=${KOKKOS_PATH}/bin/nvcc_wrapper --arch=Volta70,Power9 --with-cuda-options=enable_lambda --kokkos-path=${KOKKOS_PATH} --kokkoskernels-path=${KOKKOSKERNELS_PATH} --with-scalars='double,complex_double' --with-ordinals=int --with-offsets=int,size_t --cxxstandard=20
Test failures:
16:17:09 The following tests FAILED:
16:17:09 3 - batched_dla_cuda (Subprocess aborted)
16:17:09 4 - batched_gemm_cuda (Subprocess aborted)
The tests above passed on kokkos-dev-2 with sems-cuda/12.4 + sems-gcc/13.2.0
@ndellingwood so with cuda 12.4 we have the batched_dla_cuda
and batched_gemm_cuda
working correctly? Anything else failing on that platform?
@ndellingwood so with cuda 12.4 we have the
batched_dla_cuda
andbatched_gemm_cuda
working correctly? Anything else failing on that platform?
@lucbv on kokkos-dev-2 the configuration here (with Power9 dropped), using sems-cuda/12.4, the tests passed 100%
Sub-tests are failing in cuda/12.0 builds with the
batched_dla_cuda
andbatched_gemm_cuda
unit tests with error messagecudaDeviceSynchronize() error( cudaErrorMisalignedAddress): misaligned address
batched_dla_cuda
batched_gemm_cuda
Reproducer (kokkos-dev-2):