kokkos / kokkos-kernels

Kokkos C++ Performance Portability Programming Ecosystem: Math Kernels - Provides BLAS, Sparse BLAS and Graph Kernels
Other
302 stars 96 forks source link

Internal compiler error in KokkosBatched::Experimental::TeamGemm #349

Closed huttered40 closed 5 years ago

huttered40 commented 5 years ago

I am getting an internal compiler error when running KokkosBatched::Experimental::TeamGemm on White machine - rhel 7G queue. The GCC compiler version is 7.2.0 and I tried 6.4.0 as well, both with same issue. This error does not occur when running on Bowman with GCC 4.9.3. Most of the stack trace is posted below:

../kokkos-kernels/src/batched/KokkosBatched_Gemm_Team_Internal.hpp:137:27: internal compiler error: in maybe_undo_parenthesized_ref, at cp/semantics.c:1705
             const int i = (ij/nq)*mb;
               ~~~~~~~~~~~~^~~~~~
0x102ebfe3 maybe_undo_parenthesized_ref(tree_node*)
    ../.././gcc/cp/semantics.c:1704
0x1034eacf cp_fold
    ../.././gcc/cp/cp-gimplify.c:2141
0x1034f8b7 cp_fold_maybe_rvalue
    ../.././gcc/cp/cp-gimplify.c:2003
0x1034e5b7 cp_fold
    ../.././gcc/cp/cp-gimplify.c:2110
0x1034f8b7 cp_fold_maybe_rvalue
    ../.././gcc/cp/cp-gimplify.c:2003
0x1034e27f cp_fold_rvalue
    ../.././gcc/cp/cp-gimplify.c:2024
0x1034e27f cp_fold
    ../.././gcc/cp/cp-gimplify.c:2242
0x102ba7db cp_build_binary_op(unsigned int, tree_code, tree_node*, tree_node*, int)
    ../.././gcc/cp/typeck.c:5243
0x101b430f build_new_op_1
    ../.././gcc/cp/call.c:5982
0x101b4eff build_new_op(unsigned int, tree_code, int, tree_node*, tree_node*, tree_node*, tree_node**, int)
    ../.././gcc/cp/call.c:6026
0x102af247 build_x_binary_op(unsigned int, tree_code, tree_node*, tree_code, tree_node*, tree_code, tree_node**, int)
    ../.././gcc/cp/typeck.c:3928
0x10206f33 tsubst_copy_and_build(tree_node*, tree_node*, int, tree_node*, bool, bool)
    ../.././gcc/cp/pt.c:16937
0x101f5edf tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
    ../.././gcc/cp/pt.c:16550
0x101f79c7 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
    ../.././gcc/cp/pt.c:15786
0x101f79c7 tsubst_init
    ../.././gcc/cp/pt.c:14483
0x101f6edf tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
    ../.././gcc/cp/pt.c:15907
0x101f489b tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
    ../.././gcc/cp/pt.c:15801
0x101f4b13 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
    ../.././gcc/cp/pt.c:16027
0x101f489b tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
    ../.././gcc/cp/pt.c:15801
0x101f4b13 tsubst_expr(tree_node*, tree_node*, int, tree_node*, bool)
    ../.././gcc/cp/pt.c:16027
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.
mhoemmen commented 5 years ago

@huttered40 I edited your post just to avoid undesired Markdown formatting in the compiler output. In the future, please enclose verbatim text in triple backticks. Thanks!

ndellingwood commented 5 years ago

@huttered40 could you post more info about your configuration and build? I was able to build kokkos-kernels on the pascal queue (rhel7G) on White. Here is my setup:

Tested with VOTD develop branch of kokkos and kokkos-kernels:

kokkos SHA: kokkos/kokkos@b18689e41716b7cb8d3f30e637f3ac500756f4cc

kokkos-kernels SHA: b26f4461655bd64b374827c1858b0ee2d9aa7219

Modules: module load devpack/20180521/openmpi/2.1.2/gcc/7.2.0/cuda/9.2.88

Configuration: I have kokkos and kokkos-kernels clone to my $HOME directory.

KOKKOS_PATH=${HOME}/kokkos #path to kokkos source
KOKKOSKERNELS_SCALARS=double #the scalar types to instantiate =double,float...
KOKKOSKERNELS_LAYOUTS=LayoutLeft,LayoutRight  #the layout types to instantiate.
KOKKOSKERNELS_ORDINALS=int #ordinal types to instantiate
KOKKOSKERNELS_OFFSETS=int #offset types to instantiate
KOKKOSKERNELS_PATH=../.. #path to kokkos-kernels top directory.
CXX=${KOKKOS_PATH}/bin/nvcc_wrapper #icpc #
KOKKOSKERNELS_OPTIONS=eti-only #options for kokkoskernels  
KOKKOS_DEVICES="Cuda,Serial"
KOKKOS_ARCHS="Power8,Pascal60"
CXXFLAGS="-pedantic -O3 -g -Wshadow -Wsign-compare -Wtype-limits -Wuninitialized"

../../scripts/generate_makefile.bash --kokkoskernels-path=${KOKKOSKERNELS_PATH} --with-scalars=${KOKKOSKERNELS_SCALARS} --with-ordinals=${KOKKOSKERNELS_ORDINALS} --with-offsets=${KOKKOSKERNELS_OFFSETS} --kokkos-path=${KOKKOS_PATH} --with-devices=${KOKKOS_DEVICES} --arch=${KOKKOS_ARCHS} --compiler=${CXX} --with-options=${KOKKOSKERNELS_OPTIONS}  --cxxflags="${CXXFLAGS}"

Interactive node session: bsub -Is -n 1 -q rhel7G bash

Build library then tests: make install-lib -j16 cd unit_test make -j

huttered40 commented 5 years ago

Interactive node session: bsub -Is -q rhel7G -n 32 bash

Module: devpack/20180521/openmpi/3.1.0/gcc/7.2.0/cuda/9.2.88

Kokkos: branch: develop most recent commit hash: b18689e

Kokkos-kernels: branch: develop most recent commit hash: b26f446

Relevant part of Makefile:

CXXFLAGS = -O3 --expt-extended-lambda --expt-relaxed-constexpr# -std=c++14
KOKKOS_CXX_STANDARD=c++14               # Currently only works when using the develop branch of kokkos
LINK = ${CXX}
LDFLAGS =
EXE = test.cuda
KOKKOS_DEVICES = "Cuda"
KOKKOS_ARCH = "Power8,Pascal60" # For rhel-7G queue on White
KOKKOS_CUDA_OPTIONS += "enable_lambda"

My application is calling KokkosBatched::Experimental::TeamGemm<TransposeAType,TransposeBType,GemmAlgType>::invoke(...)

The error again is: ../kokkos-kernels/src/batched/KokkosBatched_Gemm_Team_Internal.hpp:136:27: internal compiler error: in maybe_undo_parenthesized_ref, at cp/semantics.c:1705 const int i = ij/nq*mb, j = ij%nq*nb;

ndellingwood commented 5 years ago

@huttered40 if you modify the way to generate your makefile like below it should work (it worked for me) - use the --with-cuda-options argument to set enable_lambda (this takes care of --expt-extended-lambda) and set KOKKOS_CXXFLAGS="--expt-relaxed-constexpr"

KOKKOS_PATH=${HOME}/kokkos #path to kokkos source
KOKKOSKERNELS_SCALARS=double #the scalar types to instantiate =double,float...
KOKKOSKERNELS_LAYOUTS=LayoutLeft,LayoutRight  #the layout types to instantiate.
KOKKOSKERNELS_ORDINALS=int #ordinal types to instantiate
KOKKOSKERNELS_OFFSETS=int #offset types to instantiate
KOKKOSKERNELS_PATH=../.. #path to kokkos-kernels top directory.
CXX=${KOKKOS_PATH}/bin/nvcc_wrapper #icpc #
KOKKOS_CXX_STANDARD=c++14
KOKKOS_CXXFLAGS="--expt-relaxed-constexpr"
KOKKOSKERNELS_OPTIONS=eti-only #options for kokkoskernels  
KOKKOS_DEVICES="Cuda,Serial"
KOKKOS_ARCHS="Power8,Pascal60"
KOKKOS_CUDA_OPTION="enable_lambda" #"enable_lambda,force_uvm,rdc"
CXXFLAGS="-pedantic -O3 -g -Wshadow -Wsign-compare -Wtype-limits -Wuninitialized"

../../scripts/generate_makefile.bash --kokkoskernels-path=${KOKKOSKERNELS_PATH} --with-scalars=${KOKKOSKERNELS_SCALARS} --with-ordinals=${KOKKOSKERNELS_ORDINALS} --with-offsets=${KOKKOSKERNELS_OFFSETS} --kokkos-path=${KOKKOS_PATH} --with-devices=${KOKKOS_DEVICES} --arch=${KOKKOS_ARCHS} --compiler=${CXX} --with-options=${KOKKOSKERNELS_OPTIONS}  --cxxflags="${CXXFLAGS}" --with-cuda-options=${KOKKOS_CUDA_OPTION}
ndellingwood commented 5 years ago

@huttered40 oop, I didn't properly set KOKKOS_CXX_STANDARD=c++14, when I did that I saw your failure. Cross-referencing your PR with fix here: #350

kyungjoo-kim commented 5 years ago

I don't think that we can handle the compiler error. The code is header only code and it is compiled within your code. It is a very unlucky case but I don't think that we can give much of help for this compiler error.

ndellingwood commented 5 years ago

Probably have to grind this down to a reproducer for Nvidia since c++14 should be supported...

kyungjoo-kim commented 5 years ago

@ndellingwood Does kokkos officially support c++14 ?

ndellingwood commented 5 years ago

@kyungjoo-kim good point, there isn't nightly testing with c++14 enabled so we shouldn't claim it is officially supported. I put in PR kokkos/kokkos#1913 so we can enable c++14 through generated makefiles and begin testing.

srajama1 commented 5 years ago

I am reopening this. We have multiple requests to support C++14. It doesn't have to be every version of every compiler with C++14 support as this is evolving. However, we do have to support gcc 7.2. Trilinos is moving the PR testing to gcc 7.2 very soon.

ndellingwood commented 5 years ago

Adding @crtrott he said he'd also help look into this.

crtrott commented 5 years ago

Apparently fixed in GCC 7.3: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=882855

At least the last 4 entries of the call stack are the same:

Debian Bug:

0x102ebfe3 maybe_undo_parenthesized_ref(tree_node*)
    ../.././gcc/cp/semantics.c:1704
0x1034eacf cp_fold
    ../.././gcc/cp/cp-gimplify.c:2141
0x1034f8b7 cp_fold_maybe_rvalue
    ../.././gcc/cp/cp-gimplify.c:2003
0x1034e5b7 cp_fold
    ../.././gcc/cp/cp-gimplify.c:2110
0x1022385b store_init_value(tree_node*, tree_node*, vec<tree_node*, va_gc, vl_embed>**, int)
    ../.././gcc/cp/typeck2.c:841

KokkosKernels:

0x102ebfe3 maybe_undo_parenthesized_ref(tree_node*)
    ../.././gcc/cp/semantics.c:1704
0x1034eacf cp_fold
    ../.././gcc/cp/cp-gimplify.c:2141
0x1034f8b7 cp_fold_maybe_rvalue
    ../.././gcc/cp/cp-gimplify.c:2003
0x1034e5b7 cp_fold
    ../.././gcc/cp/cp-gimplify.c:2110
0x102ba7db cp_build_binary_op(unsigned int, tree_code, tree_node*, tree_node*, int)
    ../.././gcc/cp/typeck.c:5243
nmhamster commented 5 years ago

What's the chance for work-around in GCC 7.2?

crtrott commented 5 years ago

Looking into it. My guess is that it is pretty good that we can work around this. The compiler gets tripped up in something related to figuring out whether something is an rvalue or so. So adding some parenthesis, explicit casts, using a temporary instead of inline computing the value, etc. may avoid the trigger.

crtrott commented 5 years ago

Ok found two options for this original code. The offending thing is capturing idx_j by reference in the inner most layer, where part of idx_j is coming from the argument to another inlined lambda.

      Kokkos::parallel_for(Kokkos::TeamThreadRange(team,blockDim_j), [&] (const int j) {
        const int idx_j = offset_j+j;
        Kokkos::parallel_for(Kokkos::ThreadVectorRange(team,blockDim_i), [&] (const int i) {
          const int idx_i = offset_i+i;
          A_scr(i,j) = idx_i<A.extent_int(0) && idx_j<A.extent_int(1) ? A(idx_i,idx_j) : ATV::zero();
        });
      });

Option 1: Capture by value in innermost-lambda:

      Kokkos::parallel_for(Kokkos::TeamThreadRange(team,blockDim_j), [&] (const int j) {
        const int idx_j = offset_j+j;
        Kokkos::parallel_for(Kokkos::ThreadVectorRange(team,blockDim_i), [=] (const int i) {
          const int idx_i = offset_i+i;
          A_scr(i,j) = idx_i<A.extent_int(0) && idx_j<A.extent_int(1) ? A(idx_i,idx_j) : ATV::zero();
        });
      });

Option2: move the offset calculation in the innermost loop:

      Kokkos::parallel_for(Kokkos::TeamThreadRange(team,blockDim_j), [&] (const int j) {
        Kokkos::parallel_for(Kokkos::ThreadVectorRange(team,blockDim_i), [&] (const int i) {
          const int idx_j = offset_j+j;
          const int idx_i = offset_i+i;
          A_scr(i,j) = idx_i<A.extent_int(0) && idx_j<A.extent_int(1) ? A(idx_i,idx_j) : ATV::zero();
        });
      });

My guess is that the second option is better. In any case we can ifdef this with C++ standard and GCC version.

crtrott commented 5 years ago

Btw. this applies to all similar places in the code: 91, 118, 145, ...

kyungjoo-kim commented 5 years ago

@crtrott Is the original code still legal in C++ standards (nesting two lambdas and capruting values by reference) ? I have many places that follow the same pattern of this.

crtrott commented 5 years ago

This is legal C++ (depending on what you do it might not be legal Kokkos though: remember the code must be valid when capturing by value, but capturing by reference may get better performance).

kyungjoo-kim commented 5 years ago

I also prefer the second option. Anyway you are a magician. How do you know that the compiler problem is due to capturing values by refernce ?

crtrott commented 5 years ago

If you look at the call stack, the functions name indicate that it tries to optimize away expressions (fold), it tries to figure out if something is an rvalue and then crashes when it tries to optimize some reference access inside a parenthesis. This is all just educated guesses but looks like I guessed right ;-).

crtrott commented 5 years ago

Ah I am working on the proper fix and will issue a pull request.

kyungjoo-kim commented 5 years ago

thanks.

crtrott commented 5 years ago

Found a couple more places which could be resolved by making temporaries non-const ... I didn't ifdef those but put a comment in.

crtrott commented 5 years ago

If somebody can run all the testing that would be great. Gotta get some other stuff done now.

ndellingwood commented 5 years ago

Cross-reference #357 PR by @crtrott

srajama1 commented 5 years ago

The fix is in develop now.

aprokop commented 4 years ago

Just wanted to let you know that I still got internal compiler error on gcc-7.4.0 on Summit:

           A_scr(i,j) = idx_i<A.extent_int(0) && idx_j<A.extent_int(1) ? A(idx_i,idx_j) : ATV::zero();

The following fixed it for me:

diff --git a/packages/kokkos-kernels/src/blas/impl/KokkosBlas3_gemm_impl.hpp b/packages/kokkos-kernels/src/blas/impl/KokkosBlas3_gemm_impl.hpp
index e68d031..da8a6a6 100644
--- a/packages/kokkos-kernels/src/blas/impl/KokkosBlas3_gemm_impl.hpp
+++ b/packages/kokkos-kernels/src/blas/impl/KokkosBlas3_gemm_impl.hpp
@@ -48,7 +48,7 @@

 #ifdef KOKKOS_ENABLE_CXX14
 #ifdef KOKKOS_COMPILER_GNU
-#if KOKKOS_COMPILER_GNU<=720
+#if KOKKOS_COMPILER_GNU<=740
 #define KOKKOS_IMPL_BATCHED_GEMM_GCC_CXX14_WORKAROUND
 #endif
 #endif
srajama1 commented 4 years ago

Strange ... Did some one actually report the error ? @ndellingwood can we put a patch in ?

ndellingwood commented 4 years ago

@srajama1 yeah, let me test this more carefully to confirm this also works with gcc/7.3; white has gcc/7.4 available but was not yet added to test_all_sandia which is how this slipped through, I'll make sure to get this coverage added as well.

ndellingwood commented 4 years ago

I wasn't able to reproduce the issue testing with gcc/7.4 in a serial build on White with c++14 support enabled, here is my generated makefile options: Generating Makefiles with options CXX=/home/projects/ppc64le/gcc/7.4.0/bin/g++ KOKKOS_DEVICES=Serial KOKKOS_ARCH=Power8 CXXFLAGS="-O3 -Werror -Wall -Wshadow -pedantic -Wsign-compare -Wtype-limits -Wignored-qualifiers -Wempty-body -Wclobbered -Wuninitialized " KOKKOS_CXX_STANDARD="c++14" LDFLAGS="-O3 " GTEST_PATH=/ascldap/users/ndellin/kokkos/tpls/gtest KOKKOSKERNELS_OPTIONS=eti-only,blas-mangle_ KOKKOS_PATH=/ascldap/users/ndellin/kokkos KOKKOSKERNELS_PATH=/ascldap/users/ndellin/kokkos-kernels I'll test with the changes suggested by @aprokop next.

ndellingwood commented 4 years ago

Same test config passed with the suggested change, I'll test more completely and then put in the PR with the change and updated scripts to make sure gcc/7.4 is also tested..

aprokop commented 4 years ago

Of note, it was part of the Trilinos config, and I used -DCMAKE_CXX_STANDARD=14 and not specify any other cxx11 related flags (like -DTrilinos_CXX11_FLAGS.