kokkos / kokkos-kernels

Kokkos C++ Performance Portability Programming Ecosystem: Math Kernels - Provides BLAS, Sparse BLAS and Graph Kernels
Other
313 stars 98 forks source link

Poor parallelism when building tests #1305

Open cwpearson opened 2 years ago

cwpearson commented 2 years ago

Building tests in Kokkos Kernels exhibits low parallelism when the CUDA and OpenMP spaces are enabled.

Consider two different ETI configurations:

"small"

float, double
layoutLeft, layoutRight
offset size_t
ordinal int
openmp, cuda

"large"

float, double, complex<float>
layoutleft, layoutright
offset size_t, int
ordinal int
openmp, cuda

The first one is meant to be somewhat representative for Kokkos Kernels users, while the large one may be more common during development and testing.

In the following images, the X axis is elapsed seconds and the Y axis is achieved parallelism of compiling translation units (i.e. .cpp files. Neither linking nor other build operations are included. The test system is Power9 + V100 ("vortex") with parallelism limited to 52 jobs (make -j52).

image image

The first spike is building Kokkos. The second plateau is building Kokkos Kernels library. The third spike is building the Kokkos Kernels tests.

Substantial time in the Kokkos kernels tests is spent building <3 files simultaneously (the "long tail"). Those files are

If possible, these files could be split up into multiple translation units to facilitate faster builds.


The data can be produced and analyzed with the tools at cwpearson/make-tracing.

One script replaces the shell Make uses to run commands. It wraps each command and records when that command starts and stops.

The python script ingests that produced file to generate a summary of the results.

cwpearson commented 2 years ago

1341