The PR splits the submitter with 3 kernels into separate submitters. It results in compiling 7 kernels (4 leaf, 2 global, 1 copy) instead of 24 kernels (8 leaf, 8 global, 8 copy: 4x for _LeafSortKernel options, and 2x for _IndexT options).
It implements TODO (see #1735):
// TODO: split the submitter into multiple ones to avoid extra compilation of kernels
// - _LeafSortKernel does not need _IndexT
// - _GlobalSortKernel does not need _LeafDPWI and _LeafWGS
// - _CopyBackKernel does not need either of them
The PR splits the submitter with 3 kernels into separate submitters. It results in compiling 7 kernels (4 leaf, 2 global, 1 copy) instead of 24 kernels (8 leaf, 8 global, 8 copy: 4x for _LeafSortKernel options, and 2x for _IndexT options).
It implements TODO (see #1735):