Fix work around #906 - Githubissues

fcharras commented 1 year ago

We were having more and more occurences of https://github.com/IntelPython/numba-dpex/issues/906 to the point that it felt counter-productive to continue developing with numba_dpex until it's fixed.

After torturing the reproducers it seems I've found a way to consistently work around the bug, without having to change the logic of the kernels.

All kernels that contain barriers seems to be affected. The workaround basically consists in moving instructions that contain array.__setitem__ calls (or atomic functions) in those kernels to dpex.func device functions.

The added unit test in this PR demonstrate how it works on a small example. I will also expand on it in https://github.com/IntelPython/numba-dpex/issues/906 .

The minimal reproducer show that not all such instructions seems to require to be moved to device functions, only replacing a select few ones seems to be enough to make the bugs disappear, but since the trigger condition remain unclear for this bug I went the safe way and replaced them all.

The downside is of course readability, but I don't think we should try to improve it now (having the constraint of incorporating those unnatural dpex.func factorizations make it very challenging anyway) let's rather move on and hope the bug is fixed upstream in the coming month so we can revert all the hacks.

Three more unrelated fixes have infiltrated the PR:

[x] A dtype mismatch that fails some sklearn tests when the gpu does not support float64 compute
[x] I noticed that our tolerance threshold was severely off for large datasets (we forgot to divide it by n_samples...), not sure how we let this slip through
[x] On my machine, the scikit-learn test test_kmeans_verbose[0-lloyd] suddenly decided that from now on it should fail half of the time. Its reason is a mechanism I'm skeptical about in scikit-learn that is called "strict convergence" that is met by comparing labels that have been computed by two successive iterations of lloyd. I had to implement the exact same behavior in our lloyd to consistently pass the test, but I think it should only be done when tol == 0 (which is enough to pass the tests), see arguments in an inline comment in the PR.

fcharras commented 1 year ago

TY for the reviews @jjerphan @ogrisel I've applied your suggestions and answered the question.

I've added an additional commit that fuse strict convergence checking with the main lloyd kernel.

I'll go ahead and merge if it's green after launch.

fcharras commented 1 year ago

So I noticed that using 2D work group size harms performance (kernels 30% slower). The last commit revert the 2D work group size diff from https://github.com/soda-inria/sklearn-numba-dpex/pull/88 . I will open afterward a revert PR for this revert commit, maybe if things are fixed in numba_dpex regarding 2D group sizes we can merge again later. (theoritically, it should be better)

soda-inria / sklearn-numba-dpex

Fix work around #906 #96