Closed fcharras closed 1 year ago
TY for the reviews @jjerphan @ogrisel I've applied your suggestions and answered the question.
I've added an additional commit that fuse strict convergence checking with the main lloyd kernel.
I'll go ahead and merge if it's green after launch.
So I noticed that using 2D work group size harms performance (kernels 30% slower). The last commit revert the 2D work group size diff from https://github.com/soda-inria/sklearn-numba-dpex/pull/88 . I will open afterward a revert PR for this revert commit, maybe if things are fixed in numba_dpex
regarding 2D group sizes we can merge again later. (theoritically, it should be better)
We were having more and more occurences of https://github.com/IntelPython/numba-dpex/issues/906 to the point that it felt counter-productive to continue developing with
numba_dpex
until it's fixed.After torturing the reproducers it seems I've found a way to consistently work around the bug, without having to change the logic of the kernels.
All kernels that contain barriers seems to be affected. The workaround basically consists in moving instructions that contain
array.__setitem__
calls (or atomic functions) in those kernels todpex.func
device functions.The added unit test in this PR demonstrate how it works on a small example. I will also expand on it in https://github.com/IntelPython/numba-dpex/issues/906 .
The minimal reproducer show that not all such instructions seems to require to be moved to device functions, only replacing a select few ones seems to be enough to make the bugs disappear, but since the trigger condition remain unclear for this bug I went the safe way and replaced them all.
The downside is of course readability, but I don't think we should try to improve it now (having the constraint of incorporating those unnatural
dpex.func
factorizations make it very challenging anyway) let's rather move on and hope the bug is fixed upstream in the coming month so we can revert all the hacks.Three more unrelated fixes have infiltrated the PR:
n_samples
...), not sure how we let this slip throughscikit-learn
testtest_kmeans_verbose[0-lloyd]
suddenly decided that from now on it should fail half of the time. Its reason is a mechanism I'm skeptical about inscikit-learn
that is called "strict convergence" that is met by comparing labels that have been computed by two successive iterations of lloyd. I had to implement the exact same behavior in ourlloyd
to consistently pass the test, but I think it should only be done whentol == 0
(which is enough to pass the tests), see arguments in an inline comment in the PR.