kokkos / kokkos-kernels

Kokkos C++ Performance Portability Programming Ecosystem: Math Kernels - Provides BLAS, Sparse BLAS and Graph Kernels
Other
302 stars 96 forks source link

Test failures in clang >= 10 + cuda builds #1485

Open ndellingwood opened 2 years ago

ndellingwood commented 2 years ago

In builds clang+cuda builds (e.g. clang/10+cuda/10.1, clang/13+cuda/11.7 tested) the following unit tests are failing on the develop and release-candidate-3.7.00 branches

sparse_cuda:

[ RUN      ] cuda.sparse_spgemm_double_int_size_t_TestExecSpace
entries are different.
0 2 3 5 8 11 12 13 15 16 19 20 24 32 34 36 37 38 41 42 ... ... ... 9963 9966 9968 9969 9971 9973 9974 9975 9980 9981 9982 9983 9986 9987 9988 9991 9993 9994 9995 9999 
0 2 3 5 8 11 12 13 15 16 19 20 24 32 34 36 37 38 41 42 ... ... ... 9963 9966 9968 9969 9971 9973 9974 9975 9980 9981 9982 9983 9986 9987 9988 9991 9993 9994 9995 9999 
/ascldap/users/ndellin/kokkos-kernels/unit_test/sparse/Test_Sparse_spgemm.hpp:360: Failure
Value of: is_identical
  Actual: false
Expected: true
SPGEMM_KK
...
[ RUN      ] cuda.sparse_block_spgemm_double_int_size_t_TestExecSpace
entries are different.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... ... ... 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... ... ... 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 
/ascldap/users/ndellin/kokkos-kernels/unit_test/sparse/Test_Sparse_bspgemm.hpp:286: Failure
Value of: is_identical
  Actual: false
Expected: true
SPGEMM_KK
entries are different.
1 1 2 3 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ... ... ... 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... ... ... 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 
/ascldap/users/ndellin/kokkos-kernels/unit_test/sparse/Test_Sparse_bspgemm.hpp:286: Failure
Value of: is_identical
  Actual: false
Expected: true
SPGEMM_KK
...

[  FAILED  ] cuda.sparse_spgemm_double_int_size_t_TestExecSpace
[  FAILED  ] cuda.sparse_block_spgemm_double_int_size_t_TestExecSpace

batched_dla_cuda: timeout

[ RUN      ] cuda.batched_scalar_batched_gemm_nt_nt_bhalf_bhalf_left
[       OK ] cuda.batched_scalar_batched_gemm_nt_nt_bhalf_bhalf_left (104915 ms)
[ RUN      ] cuda.batched_scalar_batched_gemm_t_nt_bhalf_bhalf_left
[       OK ] cuda.batched_scalar_batched_gemm_t_nt_bhalf_bhalf_left (105098 ms)
[ RUN      ] cuda.batched_scalar_batched_gemm_nt_t_bhalf_bhalf_left
[       OK ] cuda.batched_scalar_batched_gemm_nt_t_bhalf_bhalf_left (104866 ms)
[ RUN      ] cuda.batched_scalar_batched_gemm_t_t_bhalf_bhalf_left
[       OK ] cuda.batched_scalar_batched_gemm_t_t_bhalf_bhalf_left (105015 ms)
[ RUN      ] cuda.batched_scalar_batched_gemm_nt_nt_bhalf_bhalf_right
[       OK ] cuda.batched_scalar_batched_gemm_nt_nt_bhalf_bhalf_right (115381 ms)
[ RUN      ] cuda.batched_scalar_batched_gemm_t_nt_bhalf_bhalf_right
[       OK ] cuda.batched_scalar_batched_gemm_t_nt_bhalf_bhalf_right (115601 ms)
[ RUN      ] cuda.batched_scalar_batched_gemm_nt_t_bhalf_bhalf_right
[       OK ] cuda.batched_scalar_batched_gemm_nt_t_bhalf_bhalf_right (115549 ms)
[ RUN      ] cuda.batched_scalar_batched_gemm_t_t_bhalf_bhalf_right
[       OK ] cuda.batched_scalar_batched_gemm_t_t_bhalf_bhalf_right (115463 ms)
[ RUN      ] cuda.batched_scalar_batched_gemm_nt_nt_half_half_left
[       OK ] cuda.batched_scalar_batched_gemm_nt_nt_half_half_left (165243 ms)
[ RUN      ] cuda.batched_scalar_batched_gemm_t_nt_half_half_left
[       OK ] cuda.batched_scalar_batched_gemm_t_nt_half_half_left (165299 ms)
[ RUN      ] cuda.batched_scalar_batched_gemm_nt_t_half_half_left
# Timeout here

Reproducer (kokkos-dev-2):

source /projects/sems/modulefiles/utils/sems-archive-modules-init.sh
module load sems-archive-env
module load sems-archive-gcc/7.3.0  sems-archive-clang/10.0.0 sems-archive-cuda/10.1 sems-archive-cmake/3.19.1

$KOKKOSKERNELS_PATH/cm_generate_makefile.bash --with-cuda --compiler=clang++ --arch=Volta70
e10harvey commented 2 years ago

The batched gemm tests do take more cycles than our other unit-tests; I suggest increasing the timeout.

@ndellingwood: Are the batched gemm timeouts a recent regression or are you running these tests for the first time with clang >= 10 + cuda?

ndellingwood commented 2 years ago

@ndellingwood: Are the batched gemm timeouts a recent regression or are you running these tests for the first time with clang >= 10 + cuda?

@e10harvey these tests had passed, though I can't recall a previous date / sha to give better info I'm going to rerun the tests toggling Kokkos_ENABLE_COMPLEX_ALIGN, this had an impact on the https://github.com/kokkos/kokkos/issues/5312 and so this may be an underlying Kokkos issue. Will post back once I finish testing

ndellingwood commented 2 years ago

I tested builds with -DKokkos_ENABLE_COMPLEX_ALIGN=ON and -DKokkos_ENABLE_COMPLEX_ALIGN=OFF, the cuda.sparse_spgemm_double_int_size_t_TestExecSpace and cuda.sparse_block_spgemm_double_int_size_t_TestExecSpace tests fail in either case.

I rebuilt with -DKokkos_ENABLE_DEBUG=ON and -DKokkos_ENABLE_DEBUG_BOUNDS_CHECK=ON, but there was no additional useful diagnostic info beyond the output in the OP.

ndellingwood commented 2 years ago

Also seeing these warning with this build:

ptxas warning : Unresolved extern variable '_ZN6Kokkos12_GLOBAL__N_13ALLE' in whole program compilation, ignoring extern qualifier

Demangled _ZN6Kokkos12_GLOBAL__N_13ALLE

[ndellin@kokkos-dev-2 Clang10Cuda101Sems-aligntest]$ c++filt -t _ZN6Kokkos12_GLOBAL__N_13ALLE
Kokkos::(anonymous namespace)::ALL
lucbv commented 2 years ago

Yeah, that happened a bunch with OpenMP Target too, I will ask about it on the Kokkos channel, I'm not sure it's related though... Also I have a build going on Kokkos-dev2 so should be able to assess this problem soon.

ndellingwood commented 2 years ago

@lucbv reproduced the failure and this is present since at least the 3.6.00 release; removing blocker on 3.7.00 promotion