ENH small cleaning and optimizations accross the repo

fcharras commented 1 year ago

[x] Replace signature exposing sizes with a signature exposing one argument shape
[x] Where applicable, use 2D grid of work items rather than 1D + divisions
- note: found this piece of information: https://stackoverflow.com/a/15044884 suggesting that:
  - 2D grid really are better for performance because doing // and % ops in the kernel is expensive
  - it indexes work items in a "row major order" meaning that items in the same row belong to the same sub group (/warp)
- even if the previous information applies to cuda. I assume it's also true for numba-dpex/ sycl
[x] some kernels are now written only for a given shape and reused for other dimension with reshaping tricks.
[x] also use 2D grids for kmeans kernels
[x] document better the use of centroids_private_copies_max_cache_occupancy and improve the heuristic using device.max_compute_units
[x] a couple more nitpicks

~~It's WIP because using 2D groups for the sum over axis 0 kernel seems to trigger weirg bugs like in https://github.com/IntelPython/numba-dpex/issues/892 in some of the tests.~~

Out of WIP, remaining issues were unrelated, PR is green.

edit: confirmed affected by https://github.com/IntelPython/numba-dpex/issues/906

fcharras commented 1 year ago

Out of WIP. See the top level post for the list of changes.

fcharras commented 1 year ago

Something is wrong with sum(axis=1) on CPU when work_group_size is >=8. Maybe another bug in the JIT or something we don't understand about group sizes on CPU (or a bit of both). In both cases the investigation looks complicated and CPU is not the priority target so the latest commit propose forcing work_group_size == 8 with work_group_size was set to max.

I'll also run this branch on the dev cloud with the flex170 to see if it shows any performance improvement, and play with the group size to see if it affects performance more than it seems to do on local iGPU.

fcharras commented 1 year ago

I've reduced the last failure on the pipeline to what appears to be a JIT issue. The merge will be blocked until it's fixed. It's likely the same JIT issue we've seen before and it gave a much nicer reproducer. See https://github.com/IntelPython/numba-dpex/issues/906 .

fcharras commented 1 year ago

TY for the careful review @ogrisel @jjerphan the last commit should address your suggestions and I've answered some of your comments.

jjerphan commented 1 year ago

Perfect. :+1:

I leave @ogrisel merge.

ogrisel commented 1 year ago

The tests are still red. Ok for merge once fixed.

fcharras commented 1 year ago

The tests fail because of https://github.com/IntelPython/numba-dpex/issues/906 we can merge as soon as it's fixed and we can bump

If there's a conflict with other branches before that we can merge except the diff on the 2d sum kernel.

fcharras commented 1 year ago

Closed via https://github.com/soda-inria/sklearn-numba-dpex/pull/98 and https://github.com/soda-inria/sklearn-numba-dpex/pull/96

soda-inria / sklearn-numba-dpex

ENH small cleaning and optimizations accross the repo #88