Building tests in Kokkos Kernels exhibits low parallelism when the CUDA and OpenMP spaces are enabled.
Consider two different ETI configurations:
"small"
float, double
layoutLeft, layoutRight
offset size_t
ordinal int
openmp, cuda
"large"
float, double, complex<float>
layoutleft, layoutright
offset size_t, int
ordinal int
openmp, cuda
The first one is meant to be somewhat representative for Kokkos Kernels users, while the large one may be more common during development and testing.
In the following images, the X axis is elapsed seconds and the Y axis is achieved parallelism of compiling translation units (i.e. .cpp files.
Neither linking nor other build operations are included.
The test system is Power9 + V100 ("vortex") with parallelism limited to 52 jobs (make -j52).
The first spike is building Kokkos.
The second plateau is building Kokkos Kernels library.
The third spike is building the Kokkos Kernels tests.
Substantial time in the Kokkos kernels tests is spent building <3 files simultaneously (the "long tail").
Those files are
Test_Cuda_Batched_Dense.cpp
Test_OpenMP_Batched_Dense.cpp
Test_Cuda_Sparse.cpp
Test_Cuda_Blas.cpp
Kokkos_Blas3_perf_test.cpp
If possible, these files could be split up into multiple translation units to facilitate faster builds.
Building tests in Kokkos Kernels exhibits low parallelism when the CUDA and OpenMP spaces are enabled.
Consider two different ETI configurations:
"small"
"large"
The first one is meant to be somewhat representative for Kokkos Kernels users, while the large one may be more common during development and testing.
In the following images, the X axis is elapsed seconds and the Y axis is achieved parallelism of compiling translation units (i.e.
.cpp
files. Neither linking nor other build operations are included. The test system is Power9 + V100 ("vortex") with parallelism limited to 52 jobs (make -j52
).The first spike is building Kokkos. The second plateau is building Kokkos Kernels library. The third spike is building the Kokkos Kernels tests.
Substantial time in the Kokkos kernels tests is spent building <3 files simultaneously (the "long tail"). Those files are
Test_Cuda_Batched_Dense.cpp
Test_OpenMP_Batched_Dense.cpp
Test_Cuda_Sparse.cpp
Test_Cuda_Blas.cpp
Kokkos_Blas3_perf_test.cpp
If possible, these files could be split up into multiple translation units to facilitate faster builds.
The data can be produced and analyzed with the tools at cwpearson/make-tracing.
One script replaces the shell Make uses to run commands. It wraps each command and records when that command starts and stops.
The python script ingests that produced file to generate a summary of the results.