merge-sort: reduce the number of kernels to compile

The PR splits the submitter with 3 kernels into separate submitters. It results in compiling 7 kernels (4 leaf, 2 global, 1 copy) instead of 24 kernels (8 leaf, 8 global, 8 copy: 4x for _LeafSortKernel options, and 2x for _IndexT options).

It implements TODO (see #1735):

  // TODO: split the submitter into multiple ones to avoid extra compilation of kernels
  // - _LeafSortKernel does not need _IndexT
  // - _GlobalSortKernel does not need _LeafDPWI and _LeafWGS
  // - _CopyBackKernel does not need either of them

oneapi-src / oneDPL

merge-sort: reduce the number of kernels to compile #1740