ENH Use a 2D grid of work items where applicable

[x] Where applicable, use 2D grid of work items rather than 1D + divisions
- note: found this piece of information: https://stackoverflow.com/a/15044884 suggesting that:
  - 2D grid really are better for performance because doing // and % ops in the kernel is expensive
  - it indexes work items in a "row major order" meaning that items in the same row belong to the same sub group (/warp)
- even if the previous information applies to cuda. I assume it's also true for numba-dpex/ sycl
[x] also use 2D grids for kmeans kernels

It's WIP because the benchmark show that those changes induce a 30% performance overhead, there seem to be an issue with 2D grid of work items in numba_dpex or SYCL.

Opened an issue https://github.com/IntelPython/numba-dpex/issues/941 TODO: find minimal reproducers.

This PR remains opened as WIP in case those issues are fixed eventually.

Before considering merging the PR, please rebase the branch and check if new kernels on main branch could also benefit from the change.

soda-inria / sklearn-numba-dpex

ENH Use a 2D grid of work items where applicable #98