soda-inria / sklearn-numba-dpex

Experimental plugin for scikit-learn to be able to run (some estimators) on Intel GPUs via numba-dpex.
BSD 3-Clause "New" or "Revised" License
15 stars 4 forks source link

Fix work around #906 #96

Closed fcharras closed 1 year ago

fcharras commented 1 year ago

We were having more and more occurences of https://github.com/IntelPython/numba-dpex/issues/906 to the point that it felt counter-productive to continue developing with numba_dpex until it's fixed.

After torturing the reproducers it seems I've found a way to consistently work around the bug, without having to change the logic of the kernels.

All kernels that contain barriers seems to be affected. The workaround basically consists in moving instructions that contain array.__setitem__ calls (or atomic functions) in those kernels to dpex.func device functions.

The added unit test in this PR demonstrate how it works on a small example. I will also expand on it in https://github.com/IntelPython/numba-dpex/issues/906 .

The minimal reproducer show that not all such instructions seems to require to be moved to device functions, only replacing a select few ones seems to be enough to make the bugs disappear, but since the trigger condition remain unclear for this bug I went the safe way and replaced them all.

The downside is of course readability, but I don't think we should try to improve it now (having the constraint of incorporating those unnatural dpex.func factorizations make it very challenging anyway) let's rather move on and hope the bug is fixed upstream in the coming month so we can revert all the hacks.

Three more unrelated fixes have infiltrated the PR:

fcharras commented 1 year ago

TY for the reviews @jjerphan @ogrisel I've applied your suggestions and answered the question.

I've added an additional commit that fuse strict convergence checking with the main lloyd kernel.

I'll go ahead and merge if it's green after launch.

fcharras commented 1 year ago

So I noticed that using 2D work group size harms performance (kernels 30% slower). The last commit revert the 2D work group size diff from https://github.com/soda-inria/sklearn-numba-dpex/pull/88 . I will open afterward a revert PR for this revert commit, maybe if things are fixed in numba_dpex regarding 2D group sizes we can merge again later. (theoritically, it should be better)