Closed srajama1 closed 6 years ago
@kyungjoo-kim @ndellingwood This should have blocked promotion. I am not sure how we let it slide.
I'm working on this
@ndellingwood If you want me to do, I have a little time today to do this. Does this block the promotion ? or this update will be merged after the promotion of kokkos ? I just don't want to do redundant work if you already start it.
Hi @kyungjoo-kim I started on this a couple days ago but worked on some kokkos promotion stuff yesterday and did not complete. I'll summarize a couple things I did so far and the issues that I ran into when I paused - if it looks like I am on the right track I can work more on this today also, but if you think the changes I have made are more work than necessary then it would be better for you to take over since you know the MKL interface, changes and the test codes better. Next message will have the info.
@kyungjoo-kim These files were pretty straightforward to update:
M src/batched/KokkosBatched_Gemm_Serial_Impl.hpp
M src/batched/KokkosBatched_Gemv_Serial_Impl.hpp
M src/batched/KokkosBatched_LU_Serial_Impl.hpp
M src/batched/KokkosBatched_Trsm_Serial_Impl.hpp
M src/batched/KokkosBatched_Trsv_Serial_Impl.hpp
M src/batched/KokkosBatched_Util.hpp
The changes necessary were:
convert cblas_*_compact
calls to mkl_*_compact
convert Cblas*
types to proper MKL types:
CblasRowMajor -> MKL_ROW_MAJOR, CblasColMajor -> MKL_COL_MAJOR
CblasLeft -> MKL_LEFT, CblasRight -> MKL_RIGHT
CblasUpper ->MKL_UPPER, CblasLower ->MKL_LOWER
CblasTrans -> MKL_TRANS, CblasNoTrans -> MKL_NOTRANS
CblasUnit -> MKL_UNIT, CblasNonUnit -> MKL_NONUNIT
Remove/comment out unused value_type
types (error when -Werror used)
Include mkl.h in the Util file when this condition is true:
#if __INTEL_MKL__ >= 18
#define __KOKKOSBATCHED_INTEL_MKL_BATCHED__ 1
#define __KOKKOSBATCHED_INTEL_MKL_COMPACT_BATCHED__ 1
These changes were straightforward and I can commit and put in PR with these changes to work off of. The perf_test's are where I'm having more issues, will post on that next.
Several compilation issues with these perf_tests:
M perf_test/batched/KokkosBatched_Test_Gemm_Host.hpp
M perf_test/batched/KokkosBatched_Test_LU_Host.hpp
M perf_test/batched/KokkosBatched_Test_Trsm_Host.hpp
Issues:
if (std::is_same<value_type,double>::value) {
cblas_dgemm(...);
} else if (std::is_same<value_type,Kokkos::complex<double> >::value) {
cblas_zgemm(...);
}
This causes compilation issues because, even though the runtime logic of the if-else is correct, both code paths must be compiled and the different function signatures + types results in errors.
I started working on this by creating static member functions within struct
s specialized on the value_type
, for example (function signatures a bit messed up I need to finish fixing...):
template < typename ScalarType >
struct CompileTimeTypeHelper;
template <>
struct CompileTimeTypeHelper< double >
{
template < typename Arg0, typename Arg1, typename Arg2, typename Arg3, typename Arg4, typename Arg5, typename Arg6, typename Arg7, typename Arg8, typename Arg9, typename Arg10, typename Arg11, typename Arg12, typename Arg13 >
static void call_cblas_gemm( Arg0 arg0, Arg1 arg1, Arg2 arg2, Arg3 arg3, Arg4 arg4, Arg5 arg5, Arg6 arg6, Arg7 arg7, Arg8 arg8, Arg9 arg9, Arg10 arg10, Arg11 arg11, Arg12 arg12, Arg13 arg13 ) {
cblas_dgemm( arg0, arg1, arg2, arg3, arg4, arg5, arg6, (double*)arg7.data(), arg8.stride_0(), (double*)arg9.data(), arg10.stride_0(), arg11, (double*)arg12.data(), arg13.stride_0() );
}
template < typename Arg0, typename Arg1, typename Arg2, typename Arg3, typename Arg4, typename Arg5, typename Arg6, typename Arg7, typename Arg8, typename Arg9, typename Arg10, typename Arg11, typename Arg12, typename Arg13, typename Arg14, typename Arg15 >
static void call_cblas_gemm_batch( Arg0 arg0, Arg1 arg1, Arg2 arg2, Arg3 arg3, Arg4 arg4, Arg5 arg5, Arg6 arg6, Arg7 arg7, Arg8 arg8, Arg9 arg9, Arg10 arg10, Arg11 arg11, Arg12 arg12, Arg13 arg13, Arg14 arg14, Arg15 arg15 ) {
cblas_dgemm_batch( arg0, arg1, arg2, arg3, arg4, arg5, arg6, (const double**)arg7, arg8, (const double**)arg9, arg10, arg11, arg12, arg13, arg14, arg15 );
}
template < typename Arg0, typename Arg1, typename Arg2, typename Arg3, typename Arg4, typename Arg5, typename Arg6 >
static void call_cblas_gemm_compute_batch( Arg0 &arg0, Arg1 &arg1, Arg2 &arg2, Arg3 &arg3, Arg4 &arg4, Arg5 &arg5, Arg6 &arg6 ) {
cblas_dgemm_compute_batch( arg0, arg1, arg2, &arg3, &arg4, arg5, &arg6 );
}
};
template <>
struct CompileTimeTypeHelper< Kokkos::complex<double> >
{
template < typename Arg0, typename Arg1, typename Arg2, typename Arg3, typename Arg4, typename Arg5, typename Arg6, typename Arg7, typename Arg8, typename Arg9, typename Arg10, typename Arg11, typename Arg12, typename Arg13 >
static void call_cblas_gemm( Arg0 arg0, Arg1 arg1, Arg2 arg2, Arg3 arg3, Arg4 arg4, Arg5 arg5, Arg6 arg6, Arg7 arg7, Arg8 arg8, Arg9 arg9, Arg10 arg10, Arg11 arg11, Arg12 arg12, Arg13 arg13 ) {
cblas_zgemm( arg0, arg1, arg2, arg3, arg4, arg5, (void*)&arg6, (void*)arg7.data(), arg8.stride_0(), (void*)arg9.data(), arg10.stride_0(), (void*)&arg11, (void*)arg12.data(), arg13.stride_0() );
}
template < typename Arg0, typename Arg1, typename Arg2, typename Arg3, typename Arg4, typename Arg5, typename Arg6, typename Arg7, typename Arg8, typename Arg9, typename Arg10, typename Arg11, typename Arg12, typename Arg13, typename Arg14, typename Arg15 >
static void call_cblas_gemm_batch( Arg0 arg0, Arg1 arg1, Arg2 arg2, Arg3 arg3, Arg4 arg4, Arg5 arg5, Arg6 arg6, Arg7 arg7, Arg8 arg8, Arg9 arg9, Arg10 arg10, Arg11 arg11, Arg12 arg12, Arg13 arg13, Arg14 arg14, Arg15 arg15 ) {
cblas_zgemm_batch( arg0, arg1, arg2, arg3, arg4, arg5, arg6, (const void**)arg7, arg8, (const void**)arg9, arg10, arg11, (void**)arg12, arg13, arg14, arg15 );
}
template < typename Arg0, typename Arg1, typename Arg2, typename Arg3, typename Arg4, typename Arg5, typename Arg6 >
static void call_cblas_gemm_compute_batch( Arg0 &arg0, Arg1 &arg1, Arg2 &arg2, Arg3 &arg3, Arg4 &arg4, Arg5 &arg5, Arg6 &arg6 ) {
cblas_zgemm_compute_batch( arg0, arg1, arg2, &arg3, &arg4, arg5, &arg6 );
}
};
and then replacing the calls in the if-else statements with a call to the static member function, e.g.
replace cblas_dgemm(...);
or cblas_zgemm(...);
with
CompileTimeTypeHelper<value_type>::call_cblas_gemm(...);
etc.
The N
value in the following array declarations was considered variable, so I made changes:
value_type *aa[N*VectorLength]
to value_type **aa = new value_type*[N*VectorLength];
etc.
The types for variables assigned to a view's stride caused signed/unsigned comparison issues, so I added static_casts, e.g.
MKL_INT lda[1] = { static_cast<MKL_INT>(a.stride_1()) };
There was no type compact_t
defined and I could not find one in KokkosKernels or the 2018 MKL headers, so I started working creating a type to get past compilation errors, but no idea if this is the right thing to do...
struct compact_t {
CBLAS_LAYOUT layout;
MKL_INT* rows;
MKL_INT* cols;
MKL_INT* stride;
MKL_INT group_count;
MKL_INT* size_per_group;
MKL_INT format;
double* mat;
};
The errors I left off at may have been self-inflicted, for example -
/ascldap/users/ndellin/kokkos-kernels/perf_test/batched/KokkosBatched_Test_LU_Host.hpp:215:43: error: ‘LAPACKE_dgetrf_compute_batch’ was not declared in this scope
LAPACKE_dgetrf_compute_batch(&A_p);
This may be due to my attempt to create a compact_t type that is not recognized by the routine, unless the API for that routine changed.
Also, similar errors with cblas_dgemm_compute_batch
for example, but again likely due to my needing to finish fixing the signatures in the static member functions of the struct.
@ndellingwood I think that you are going well to a right direction.
regarding to the if statement based on the value type should be fine. when you call mkl blas, convert value_type pointer into what ever MKL prefer (MKL_Complex16*).
MKL changes function signature and it may not need the compact_t anymore. Please see the following article.
@kyungjoo-kim thanks for the feedback!
@kyungjoo-kim @srajama1
I am having compiler issues in the above perf_tests with calls of the type *_compute_batch(...);
, was this an experimental API name and if so do you know what it should be replaced by? I cannot find any types of *_compute_batch(...)
functions in 2017 or 2018 MKL, and google search brings up PR #130 when added to KokkosKernels, e.g. cblas_dgemm_compute_batch
.
@ndellingwood if you let me know which branch I should work on, I can change and test that in an hour.
@kyungjoo-kim I put in PR #243 with the src file changes. This was pushed to the kokkos-kernels repo so it should be open to modifications directly. Would you like me to push the changes I worked on for the perf_test or will that be messy clutter?
Oh... I see that. I got the right PR but I first looked at the perf_test code instead of the source directory.
@kyungjoo-kim you're probably pretty far along already, but I did a bit more looking in the mkl headers and saw the following in case its relevant -
It looks like in the *_compute_batch
calls in the perf tests, the suffix can be changed to _batch
and the compact_t
type removed and instead pass each individual component from that type to the functions.
The LAPACKE_dgetrf_compact
call in KokkosBatched_LU_Serial_Impl.hpp
may need to be replaced by mkl_dgetrfnp_compact
, but that will require the vector length argument be replaced by the proper MKL_COMPACT_PACK
type, which depends on the type of vector intrinsics for the arch compiled for which may require a bit more if-guards.
Edit: the MKL_COMPACT_PACK
comment above applies to all the _compact
MKL routines used in the KokkosKernels interface.
@kyungjoo-kim any progress? I worked on some additional fixes for the src files but did not resolve all the perf_test issues. I can push if you haven't had time to work on this, but did not want to cause any conflicts. @srajama1 all the issues haven't been resolved yet, will add this fix as a patch to Trilinos and KokkosKernels since it does not require or impact the full integration tests.
@ndellingwood You can push. As you describe, the mkl interface changed quite a lot. I am also working but not a lot.
@ndellingwood @kyungjoo-kim Let us not hold 2.7 for this, but let us fix this asap after 2.7 goes in. MKL is needed for other things (SpMV and SpGEMM) and now the code cannot be compiled with MKL now.
@srajama1 sounds good, we can patch Trilinos and since this is only for MKL changes it does not require the full integration testing.
Siva the 2.7 release is already out ...
@crtrott : Saw that after typing that comment. Good you came to the same conclusion, this is a performance test so it is ok you released with it. However, using MKL is an important use case, let us get it fixed.
PR #243 fixed this.
The Intel API for compact batched BLAS has changed. It is better to update these tests before the next release. The issue shows up only when used with Intel MKL TPL.