kokkos / kokkos-kernels

Kokkos C++ Performance Portability Programming Ecosystem: Math Kernels - Provides BLAS, Sparse BLAS and Graph Kernels
Other
311 stars 98 forks source link

Compact batched BLAS performance tests need update #216

Closed srajama1 closed 6 years ago

srajama1 commented 6 years ago

The Intel API for compact batched BLAS has changed. It is better to update these tests before the next release. The issue shows up only when used with Intel MKL TPL.

srajama1 commented 6 years ago

@kyungjoo-kim @ndellingwood This should have blocked promotion. I am not sure how we let it slide.

ndellingwood commented 6 years ago

I'm working on this

kyungjoo-kim commented 6 years ago

@ndellingwood If you want me to do, I have a little time today to do this. Does this block the promotion ? or this update will be merged after the promotion of kokkos ? I just don't want to do redundant work if you already start it.

ndellingwood commented 6 years ago

Hi @kyungjoo-kim I started on this a couple days ago but worked on some kokkos promotion stuff yesterday and did not complete. I'll summarize a couple things I did so far and the issues that I ran into when I paused - if it looks like I am on the right track I can work more on this today also, but if you think the changes I have made are more work than necessary then it would be better for you to take over since you know the MKL interface, changes and the test codes better. Next message will have the info.

ndellingwood commented 6 years ago

@kyungjoo-kim These files were pretty straightforward to update:

 M src/batched/KokkosBatched_Gemm_Serial_Impl.hpp
 M src/batched/KokkosBatched_Gemv_Serial_Impl.hpp
 M src/batched/KokkosBatched_LU_Serial_Impl.hpp
 M src/batched/KokkosBatched_Trsm_Serial_Impl.hpp
 M src/batched/KokkosBatched_Trsv_Serial_Impl.hpp
 M src/batched/KokkosBatched_Util.hpp

The changes necessary were:

  1. convert cblas_*_compact calls to mkl_*_compact

  2. convert Cblas* types to proper MKL types: CblasRowMajor -> MKL_ROW_MAJOR, CblasColMajor -> MKL_COL_MAJOR CblasLeft -> MKL_LEFT, CblasRight -> MKL_RIGHT CblasUpper ->MKL_UPPER, CblasLower ->MKL_LOWER CblasTrans -> MKL_TRANS, CblasNoTrans -> MKL_NOTRANS CblasUnit -> MKL_UNIT, CblasNonUnit -> MKL_NONUNIT

  3. Remove/comment out unused value_type types (error when -Werror used)

  4. Include mkl.h in the Util file when this condition is true:

    #if __INTEL_MKL__ >= 18
    #define __KOKKOSBATCHED_INTEL_MKL_BATCHED__ 1
    #define __KOKKOSBATCHED_INTEL_MKL_COMPACT_BATCHED__ 1

These changes were straightforward and I can commit and put in PR with these changes to work off of. The perf_test's are where I'm having more issues, will post on that next.

ndellingwood commented 6 years ago

Several compilation issues with these perf_tests:

 M perf_test/batched/KokkosBatched_Test_Gemm_Host.hpp
 M perf_test/batched/KokkosBatched_Test_LU_Host.hpp
 M perf_test/batched/KokkosBatched_Test_Trsm_Host.hpp

Issues:

  1. There are if-else statements based on 'value_type` to select double vs complex cblas calls, e.g.
    if (std::is_same<value_type,double>::value) {
                    cblas_dgemm(...);
    } else if  (std::is_same<value_type,Kokkos::complex<double> >::value) {
                    cblas_zgemm(...);
    }

    This causes compilation issues because, even though the runtime logic of the if-else is correct, both code paths must be compiled and the different function signatures + types results in errors.

I started working on this by creating static member functions within structs specialized on the value_type, for example (function signatures a bit messed up I need to finish fixing...):


        template < typename ScalarType >
         struct CompileTimeTypeHelper;

         template <>
         struct CompileTimeTypeHelper< double >
         {
           template < typename Arg0, typename Arg1, typename Arg2, typename Arg3, typename Arg4, typename Arg5, typename Arg6, typename Arg7, typename Arg8, typename Arg9, typename Arg10, typename Arg11, typename Arg12, typename Arg13 >
           static void call_cblas_gemm( Arg0 arg0, Arg1 arg1, Arg2 arg2, Arg3 arg3, Arg4 arg4, Arg5 arg5, Arg6 arg6, Arg7 arg7, Arg8 arg8, Arg9 arg9, Arg10 arg10, Arg11 arg11, Arg12 arg12, Arg13 arg13 ) {
             cblas_dgemm( arg0, arg1, arg2, arg3, arg4, arg5, arg6, (double*)arg7.data(), arg8.stride_0(), (double*)arg9.data(), arg10.stride_0(), arg11, (double*)arg12.data(), arg13.stride_0() );
           }

           template < typename Arg0, typename Arg1, typename Arg2, typename Arg3, typename Arg4, typename Arg5, typename Arg6, typename Arg7, typename Arg8, typename Arg9, typename Arg10, typename Arg11, typename Arg12, typename Arg13, typename Arg14, typename Arg15 >
           static void call_cblas_gemm_batch( Arg0 arg0, Arg1 arg1, Arg2 arg2, Arg3 arg3, Arg4 arg4, Arg5 arg5, Arg6 arg6, Arg7 arg7, Arg8 arg8, Arg9 arg9, Arg10 arg10, Arg11 arg11, Arg12 arg12, Arg13 arg13, Arg14 arg14, Arg15 arg15 ) {
             cblas_dgemm_batch( arg0, arg1, arg2, arg3, arg4, arg5, arg6, (const double**)arg7, arg8, (const double**)arg9, arg10, arg11, arg12, arg13, arg14, arg15 );
           }

           template < typename Arg0, typename Arg1, typename Arg2, typename Arg3, typename Arg4, typename Arg5, typename Arg6 >
           static void call_cblas_gemm_compute_batch( Arg0 &arg0, Arg1 &arg1, Arg2 &arg2, Arg3 &arg3, Arg4 &arg4, Arg5 &arg5, Arg6 &arg6 ) {

             cblas_dgemm_compute_batch( arg0, arg1, arg2, &arg3, &arg4, arg5, &arg6 );
           }
         };

         template <>
         struct CompileTimeTypeHelper< Kokkos::complex<double> >
         {
           template < typename Arg0, typename Arg1, typename Arg2, typename Arg3, typename Arg4, typename Arg5, typename Arg6, typename Arg7, typename Arg8, typename Arg9, typename Arg10, typename Arg11, typename Arg12, typename Arg13 >
           static void call_cblas_gemm( Arg0 arg0, Arg1 arg1, Arg2 arg2, Arg3 arg3, Arg4 arg4, Arg5 arg5, Arg6 arg6, Arg7 arg7, Arg8 arg8, Arg9 arg9, Arg10 arg10, Arg11 arg11, Arg12 arg12, Arg13 arg13 ) {
             cblas_zgemm( arg0, arg1, arg2, arg3, arg4, arg5, (void*)&arg6, (void*)arg7.data(), arg8.stride_0(), (void*)arg9.data(), arg10.stride_0(), (void*)&arg11, (void*)arg12.data(), arg13.stride_0() );
           }

           template < typename Arg0, typename Arg1, typename Arg2, typename Arg3, typename Arg4, typename Arg5, typename Arg6, typename Arg7, typename Arg8, typename Arg9, typename Arg10, typename Arg11, typename Arg12, typename Arg13, typename Arg14, typename Arg15 >
           static void call_cblas_gemm_batch( Arg0 arg0, Arg1 arg1, Arg2 arg2, Arg3 arg3, Arg4 arg4, Arg5 arg5, Arg6 arg6, Arg7 arg7, Arg8 arg8, Arg9 arg9, Arg10 arg10, Arg11 arg11, Arg12 arg12, Arg13 arg13, Arg14 arg14, Arg15 arg15 ) {
             cblas_zgemm_batch( arg0, arg1, arg2, arg3, arg4, arg5, arg6, (const void**)arg7, arg8, (const void**)arg9, arg10, arg11, (void**)arg12, arg13, arg14, arg15 );
           }

           template < typename Arg0, typename Arg1, typename Arg2, typename Arg3, typename Arg4, typename Arg5, typename Arg6 >
           static void call_cblas_gemm_compute_batch( Arg0 &arg0, Arg1 &arg1, Arg2 &arg2, Arg3 &arg3, Arg4 &arg4, Arg5 &arg5, Arg6 &arg6 ) {

             cblas_zgemm_compute_batch( arg0, arg1, arg2, &arg3, &arg4, arg5, &arg6 );
           }
         };

and then replacing the calls in the if-else statements with a call to the static member function, e.g. replace cblas_dgemm(...); or cblas_zgemm(...); with CompileTimeTypeHelper<value_type>::call_cblas_gemm(...); etc.

  1. The N value in the following array declarations was considered variable, so I made changes: value_type *aa[N*VectorLength] to value_type **aa = new value_type*[N*VectorLength]; etc.

  2. The types for variables assigned to a view's stride caused signed/unsigned comparison issues, so I added static_casts, e.g. MKL_INT lda[1] = { static_cast<MKL_INT>(a.stride_1()) };

  3. There was no type compact_t defined and I could not find one in KokkosKernels or the 2018 MKL headers, so I started working creating a type to get past compilation errors, but no idea if this is the right thing to do...

          struct compact_t {
            CBLAS_LAYOUT layout;
            MKL_INT* rows;
            MKL_INT* cols;
            MKL_INT* stride;
            MKL_INT group_count;
            MKL_INT* size_per_group;
            MKL_INT format;
            double* mat;
          };
ndellingwood commented 6 years ago

The errors I left off at may have been self-inflicted, for example -

/ascldap/users/ndellin/kokkos-kernels/perf_test/batched/KokkosBatched_Test_LU_Host.hpp:215:43: error: ‘LAPACKE_dgetrf_compute_batch’ was not declared in this scope
               LAPACKE_dgetrf_compute_batch(&A_p);

This may be due to my attempt to create a compact_t type that is not recognized by the routine, unless the API for that routine changed.

Also, similar errors with cblas_dgemm_compute_batch for example, but again likely due to my needing to finish fixing the signatures in the static member functions of the struct.

kyungjoo-kim commented 6 years ago

@ndellingwood I think that you are going well to a right direction.

  1. regarding to the if statement based on the value type should be fine. when you call mkl blas, convert value_type pointer into what ever MKL prefer (MKL_Complex16*).

  2. MKL changes function signature and it may not need the compact_t anymore. Please see the following article.

https://software.intel.com/en-us/articles/intelr-math-kernel-library-introducing-vectorized-compact-routines

ndellingwood commented 6 years ago

@kyungjoo-kim thanks for the feedback! @kyungjoo-kim @srajama1 I am having compiler issues in the above perf_tests with calls of the type *_compute_batch(...);, was this an experimental API name and if so do you know what it should be replaced by? I cannot find any types of *_compute_batch(...) functions in 2017 or 2018 MKL, and google search brings up PR #130 when added to KokkosKernels, e.g. cblas_dgemm_compute_batch.

kyungjoo-kim commented 6 years ago

@ndellingwood if you let me know which branch I should work on, I can change and test that in an hour.

ndellingwood commented 6 years ago

@kyungjoo-kim I put in PR #243 with the src file changes. This was pushed to the kokkos-kernels repo so it should be open to modifications directly. Would you like me to push the changes I worked on for the perf_test or will that be messy clutter?

kyungjoo-kim commented 6 years ago

Oh... I see that. I got the right PR but I first looked at the perf_test code instead of the source directory.

ndellingwood commented 6 years ago

@kyungjoo-kim you're probably pretty far along already, but I did a bit more looking in the mkl headers and saw the following in case its relevant -

It looks like in the *_compute_batch calls in the perf tests, the suffix can be changed to _batch and the compact_t type removed and instead pass each individual component from that type to the functions. The LAPACKE_dgetrf_compact call in KokkosBatched_LU_Serial_Impl.hpp may need to be replaced by mkl_dgetrfnp_compact, but that will require the vector length argument be replaced by the proper MKL_COMPACT_PACK type, which depends on the type of vector intrinsics for the arch compiled for which may require a bit more if-guards.

Edit: the MKL_COMPACT_PACK comment above applies to all the _compact MKL routines used in the KokkosKernels interface.

ndellingwood commented 6 years ago

@kyungjoo-kim any progress? I worked on some additional fixes for the src files but did not resolve all the perf_test issues. I can push if you haven't had time to work on this, but did not want to cause any conflicts. @srajama1 all the issues haven't been resolved yet, will add this fix as a patch to Trilinos and KokkosKernels since it does not require or impact the full integration tests.

kyungjoo-kim commented 6 years ago

@ndellingwood You can push. As you describe, the mkl interface changed quite a lot. I am also working but not a lot.

srajama1 commented 6 years ago

@ndellingwood @kyungjoo-kim Let us not hold 2.7 for this, but let us fix this asap after 2.7 goes in. MKL is needed for other things (SpMV and SpGEMM) and now the code cannot be compiled with MKL now.

ndellingwood commented 6 years ago

@srajama1 sounds good, we can patch Trilinos and since this is only for MKL changes it does not require the full integration testing.

crtrott commented 6 years ago

Siva the 2.7 release is already out ...

srajama1 commented 6 years ago

@crtrott : Saw that after typing that comment. Good you came to the same conclusion, this is a performance test so it is ok you released with it. However, using MKL is an important use case, let us get it fixed.

srajama1 commented 6 years ago

PR #243 fixed this.