kokkos / kokkos-kernels

Kokkos C++ Performance Portability Programming Ecosystem: Math Kernels - Provides BLAS, Sparse BLAS and Graph Kernels
Other
313 stars 98 forks source link

`KokkosBlas::Impl::MV_Reciprocal_Generic`: `g++-12` internal compiler failure with `-O3 -march=skylake-avx512` #2091

Open romintomasetti opened 10 months ago

romintomasetti commented 10 months ago

Description of the problem

It is a compiler internal failure that occurs only for g++-12 with -O3 -march=skylake-avx512. It works fine with g++-11 or clang++-17. It also works with g++-12 if using -O2 instead of -O3.

I know it should be reported to GCC, but I need your help for making a reproducible example that does not depend on Kokkos.

For now, I copy-pasted the offending code (KokkosBlas::Impl::MV_Reciprocal_Generic and KokkosBlas::Impl::MV_ReciprocalSelf_Functor). This reproduces the error:

g++-12 -std=c++20 -fopenmp /.../test_bug.cpp -I/opt/Trilinos/GNU-OpenMP/include -L/opt/Trilinos/GNU-OpenMP/lib -lkokkoscore -O3 -march=skylake-avx512 -o test_bug_ko
during GIMPLE pass: slp
/.../test_bug.cpp: In function 'void MV_Reciprocal_Generic(const execution_space&, const RMV&) [with execution_space = Kokkos::OpenMP; RMV = Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace> >; size_type = long unsigned int]':
/.../test_bug.cpp:93:6: internal compiler error: in get_vectype_for_scalar_type, at tree-vect-stmts.cc:11609
   93 | void MV_Reciprocal_Generic(const execution_space& space, const RMV& R)
      |      ^~~~~~~~~~~~~~~~~~~~~
0x7fdb4596bd8f __libc_start_call_main
        ../sysdeps/nptl/libc_start_call_main.h:58
0x7fdb4596be3f __libc_start_main_impl
        ../csu/libc-start.c:392
Please submit a full bug report, with preprocessed source (by using -freport-bug).
Please include the complete backtrace with any bug report.
See <file:///usr/share/doc/gcc-12/README.Bugs> for instructions.

I tried to isolate the operation that makes the compilation fail in the macro FAILING_OPERATION. If I keep the division as in the original code, it does not compile. Switching to a multiplication makes it work again.

I tried to come with a simpler functor that does not depend on Kokkos, but it always compiles.

The minimal reproducer code

This is the minimal reproducer I wrote.

#include <concepts>

#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wsuggest-override"
#pragma GCC diagnostic ignored "-Wswitch-default"
#include "Kokkos_ArithTraits.hpp"
#include "Kokkos_Core.hpp"
#pragma GCC diagnostic pop

//! This is the operation that causes the internal compiler error when used within a @c Kokkos parallel region. When dividing, it does not compile. Multiplying is OK.
#ifdef MAKE_FAILING_OP_A_SUCCESS
    #define FAILING_OPERATION(__x__) __x__ = Kokkos::complex<double>{1., 0.} * __x__;
#else
    #define FAILING_OPERATION(__x__) __x__ = Kokkos::complex<double>{1., 0.} / __x__;
#endif

/**
 * @brief Functor from https://github.com/kokkos/kokkos-kernels that causes the internal compiler error.
 *
 * See also:
 *  - https://github.com/kokkos/kokkos-kernels/blob/f429f6ecbd73b977c37573f00004228075754129/blas/impl/KokkosBlas1_reciprocal_impl.hpp#L70-L95
 */
template <class RMV, class SizeType = typename RMV::size_type>
struct MV_ReciprocalSelf_Functor {
  typedef SizeType size_type;
  typedef Kokkos::ArithTraits<typename RMV::non_const_value_type> ATS;

  const size_type numCols;
  RMV R_;

  MV_ReciprocalSelf_Functor(const RMV& R) : numCols(R.extent(1)), R_(R) {
    static_assert(Kokkos::is_view<RMV>::value,
                  "KokkosBlas::Impl::"
                  "MV_Reciprocal_Functor: RMV is not a Kokkos::View.");
    static_assert(RMV::rank == 2,
                  "KokkosBlas::Impl::"
                  "MV_Reciprocal_Functor: RMV is not rank 2");
  }

  KOKKOS_INLINE_FUNCTION
  void operator()(const size_type& i) const {
#ifdef KOKKOS_ENABLE_PRAGMA_IVDEP
#pragma ivdep
#endif
    for (size_type j = 0; j < numCols; ++j) {
        FAILING_OPERATION(R_(i, j))
    }
  }
};

/**
 * @brief Simple function that "mimics" an @c OpenMP parallel for loop like @c Kokkos would end up doing (but not exactly as
 *        @c Kokkos would do since this one always works).
 *
 * It does not trigger the internal compiler error.
 *
 * See also:
 *  - https://github.com/kokkos/kokkos/blob/34973c77309cb8af20fae065dea7e44146b22aed/core/src/OpenMP/Kokkos_OpenMP_Parallel_For.hpp#L93-L107
*/
template <typename op_t>
struct SimpleOp
{
    op_t op;

    typename op_t::size_type numCols;

    SimpleOp(op_t op_) : op(std::move(op_)), numCols(op.R_.extent(1)) {}

    void apply() const
    {
        #pragma omp parallel for schedule(static) num_threads(omp_get_num_threads())
        for(typename op_t::size_type i = 0; i < op.R_.extent(0); ++i)
        {
            this->operator()(i);
        }
    }

    KOKKOS_INLINE_FUNCTION
    void operator()(const typename op_t::size_type i) const
    {
        for(typename op_t::size_type j = 0; j < numCols; ++j)
        {
            FAILING_OPERATION(op.R_(i, j))
        }
    }
};

/**
 * @test This test reproduces a compiler error (GCC 12.3.0) when compiling the ETI
 *       for @c KokkosBlas::Impl::MV_Reciprocal_Generic with @c Kokkos::complex<double> and @c Kokkos::OpenMP
 *       only if the @c view_t is @c Kokkos::LayoutLeft.
 *
 * See also:
 *  - https://github.com/kokkos/kokkos-kernels/blob/f429f6ecbd73b977c37573f00004228075754129/blas/impl/KokkosBlas1_reciprocal_impl.hpp#L151-L176
 */
template <class execution_space, class RMV, typename size_type = typename RMV::size_type>
void MV_Reciprocal_Generic(const execution_space& space, const RMV& R)
{
    Kokkos::RangePolicy<execution_space, size_type, Kokkos::Schedule<Kokkos::Dynamic>> policy(space, 0, R.extent(0));

    using op_t = MV_ReciprocalSelf_Functor<RMV, size_type>;
    static_assert(std::is_same_v<typename op_t::ATS, Kokkos::ArithTraits<Kokkos::complex<double>>>);

    op_t op(R);
    Kokkos::parallel_for("KokkosBlas::Reciprocal::S0", policy, op); // This might fails.
}

//! When using @ref MV_Reciprocal_Generic, compilation fails if layout is @c Kokkos::LayoutLeft.
template <typename layout_t, typename execution_space>
void this_might_not_compile(const execution_space& space)
{
    using view_t = Kokkos::View<Kokkos::complex<double>**, layout_t, Kokkos::Device<execution_space, Kokkos::HostSpace>>;

    view_t X(Kokkos::view_alloc(space, "failure"), 2, 2);
    Kokkos::deep_copy(space, X, 1.);

    MV_Reciprocal_Generic(space, X);
}

//! When using @ref SimpleOp, compilation always works.
template <typename layout_t, typename execution_space>
void this_never_fails(const execution_space& space)
{
    using view_t = Kokkos::View<Kokkos::complex<double>**, layout_t, Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>>;

    view_t X(Kokkos::view_alloc(space, "failure"), 2, 2);
    Kokkos::deep_copy(space, X, 1.);

    using op_t = MV_ReciprocalSelf_Functor<view_t, typename view_t::size_type>;

    SimpleOp(op_t(X)).apply();
}

void run_test()
{
    //! @c Kokkos::LayoutLeft leads to a compiler error, while @c Kokkos::LayoutRight doesn't.
    using view_LL_t = Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft , Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>>;
    using view_LR_t = Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutRight, Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>>;

    // ! Map type of the view.
    using map_LL_t = Kokkos::Impl::ViewMapping<typename view_LL_t::traits, typename view_LL_t::traits::specialize>;
    using map_LR_t = Kokkos::Impl::ViewMapping<typename view_LR_t::traits, typename view_LR_t::traits::specialize>;
    using handle_LL_t = typename map_LL_t::handle_type;
    using handle_LR_t = typename map_LR_t::handle_type;

    static_assert(std::same_as<handle_LL_t, Kokkos::complex<double>*>);
    static_assert(std::same_as<handle_LR_t, Kokkos::complex<double>*>);

    this_never_fails<Kokkos::LayoutLeft >(Kokkos::OpenMP{});
    this_never_fails<Kokkos::LayoutRight>(Kokkos::OpenMP{});

    this_might_not_compile<Kokkos::LayoutLeft >(Kokkos::OpenMP{});
    this_might_not_compile<Kokkos::LayoutRight>(Kokkos::OpenMP{});
}

int main(int argc, char* argv[])
{
    Kokkos::initialize(argc, argv);

    run_test();

    Kokkos::finalize();

    return EXIT_SUCCESS;
}

Script for compiling

This is the bash script used to compile the code.

set -ex

KOKKOS_INCLUDE_DIR=/opt/Trilinos/GNU-OpenMP/include
KOKKOS_LIBS_DIR=/opt/Trilinos/GNU-OpenMP/lib

DIR=$PWD/tests/utils

# Works with gcc-12, either skylake-avx512 without -O3 or other instruction sets with -O3.
g++-12 -std=c++20 -fopenmp $DIR/test_bug.cpp -I$KOKKOS_INCLUDE_DIR -L$KOKKOS_LIBS_DIR -lkokkoscore -O1 -march=skylake-avx512 -o test_bug_ok

g++-12 -std=c++20 -fopenmp $DIR/test_bug.cpp -I$KOKKOS_INCLUDE_DIR -L$KOKKOS_LIBS_DIR -lkokkoscore -O2 -march=skylake-avx512 -o test_bug_ok

g++-12 -std=c++20 -fopenmp $DIR/test_bug.cpp -I$KOKKOS_INCLUDE_DIR -L$KOKKOS_LIBS_DIR -lkokkoscore -O3 -march=znver3         -o test_bug_ok

g++-12 -std=c++20 -fopenmp $DIR/test_bug.cpp -I$KOKKOS_INCLUDE_DIR -L$KOKKOS_LIBS_DIR -lkokkoscore -O3 -march=haswell        -o test_bug_ok

g++-12 -std=c++20 -fopenmp $DIR/test_bug.cpp -I$KOKKOS_INCLUDE_DIR -L$KOKKOS_LIBS_DIR -lkokkoscore -O3 -march=skylake        -o test_bug_ok

# Works with gcc-11, skylake-avx512 and -O3.
g++-11 -std=c++20 -fopenmp $DIR/test_bug.cpp -I$KOKKOS_INCLUDE_DIR -L$KOKKOS_LIBS_DIR -lkokkoscore -O3 -march=skylake-avx512 -o test_bug_ok

# Internal compiler error only for the division operator.
g++-12 -std=c++20 -fopenmp $DIR/test_bug.cpp -I$KOKKOS_INCLUDE_DIR -L$KOKKOS_LIBS_DIR -lkokkoscore -O3 -march=skylake-avx512 -o test_bug_ko -DMAKE_FAILING_OP_A_SUCCESS
g++-12 -std=c++20 -fopenmp $DIR/test_bug.cpp -I$KOKKOS_INCLUDE_DIR -L$KOKKOS_LIBS_DIR -lkokkoscore -O3 -march=skylake-avx512 -o test_bug_ko

Mentioning @kliegeois because you worked on the same machine in the past :wink: (and @maartenarnst of course)

romintomasetti commented 10 months ago

Please note that this might be a Kokkos issue (@crtrott @dalg24 @masterleinad). I guess such bugs would be detected by the Kokkos CICD...

lucbv commented 10 months ago

Hum, CICD is not able to test all compiler, flag and optimization combinations so this sort of things can always happen. In my opinion this is more of a Kokkos Core issue since the usage (taking a division) is a supported feature, see Kokkos_Complex.hpp line 919 or so.

maartenarnst commented 10 months ago

Just to follow up, I've just tested this with a few other flags

In each case, we hit the same gcc bug as with -march=skylake-avx512.

The fact that it happens for the newest architecture (sapphire rapids) could be a justification to try to forward this as an issue to the gcc developers.

Should we proceed with @romintomasetti 's reproducer from the description? Or do you have suggestion for how we could eliminate the dependency on Kokkos?