Add kmeans_dpcpp to benchmark

fcharras commented 1 year ago

Benchmark results:

Running Sklearn vanilla lloyd with parameters sample_weight=unary n_clusters=127 data_shape=(263256, 14) max_iter=100...
Running Sklearn vanilla lloyd ... done in 1.3 s

Running kmeans_dpcpp lloyd GPU with parameters sample_weight=unary n_clusters=127 data_shape=(263256, 14) max_iter=100...
/sklearn-numba-dpex/benchmark/./kmeans.py:108: ConvergenceWarning: Number of distinct clusters (32) found smaller than n_clusters (127). Possibly due to duplicate points in X.
  KMeans(**est_kwargs).set_params(max_iter=1).fit(
/sklearn-numba-dpex/benchmark/./kmeans.py:116: ConvergenceWarning: Number of distinct clusters (28) found smaller than n_clusters (127). Possibly due to duplicate points in X.
  estimator.fit(X, sample_weight=sample_weight)
Running kmeans_dpcpp lloyd GPU ... done in 1.2 s

Running Kmeans numba_dpex lloyd GPU with parameters sample_weight=unary n_clusters=127 data_shape=(263256, 14) max_iter=100...
Running Kmeans numba_dpex lloyd GPU ... done in 0.7 s

kmeans_dpcpp runs but do not return good results. (the warning

/sklearn-numba-dpex/benchmark/./kmeans.py:108: ConvergenceWarning: Number of distinct clusters (32) found smaller than n_clusters (127). Possibly due to duplicate points in X.

usually hints that the output is random)

our numba_dpex implementation is faster (NB: the benchmark ignore JIT overhead) but maybe work_group_size and centroids_window_height must be tinkered for kmeans_dpcpp. Also I think kmeans_dpcpp implements a version that we had before https://github.com/soda-inria/sklearn-numba-dpex/pull/51 and https://github.com/soda-inria/sklearn-numba-dpex/pull/49, that changed the heuristic used to choose the window over centroids, and improved performance noticeably while pruning the code overall. I'd recommend copying this heuristic from main. main for KMeans has stabilized and we don't have plan to change it as much as it was being rewritten at the time (sorry for that).

Unrelated: apparently, scikit-learn-intelex bump to 2023 broke the benchmarks sklearnex. There are also issues with the docker image that has been automatically built yesterday night, since then our test pipeline doesn't pass on main. I'm gonna look into those issues. In the meantime I think the docker image at jjerphan/numba_dpex_dev is broken, the conda install suggested in the README should be preferred.

edit: fixed all unrelated things and rebased, the benchmark can now be tested in stable conditions :pray:

fcharras commented 1 year ago

About the unrelated problems:

conda install instructions temporarily fix in https://github.com/soda-inria/sklearn-numba-dpex/pull/76 waiting for https://github.com/IntelPython/dpctl/issues/1022
docker image is being rebuilt and will be available in ~1hr from now (see https://github.com/soda-inria/sklearn-numba-dpex/pull/77 )

Please take note of this if running into issues when running the benchmark.

fcharras commented 1 year ago

Also fixed the scikit-learn-intelex==2023.0.0 issue in https://github.com/soda-inria/sklearn-numba-dpex/pull/78 and rebased this branch which can be used safely again

However, in the docker image, dpctl doesn't work with cpu devices, issues reported in https://github.com/IntelPython/dpctl/issues/1023

I'd advise our conda install guide which pins dpcpp_runtime<2023.0.0 in the meantime.

fcharras commented 1 year ago

@oleksandr-pavlyk

Update on the benchmark after the last updates:

Running Sklearn vanilla lloyd with parameters sample_weight=unary n_clusters=127 data_shape=(263256, 14) max_iter=100...
Running Sklearn vanilla lloyd ... done in 1.6 s

Running daal4py lloyd CPU with parameters sample_weight=unary n_clusters=127 data_shape=(263256, 14) max_iter=100...
Running daal4py lloyd CPU ... done in 0.7 s

Running kmeans_dpcpp lloyd GPU with parameters sample_weight=unary n_clusters=127 data_shape=(263256, 14) max_iter=100...
Running kmeans_dpcpp lloyd GPU ... done in 0.4 s

Running Kmeans numba_dpex lloyd GPU with parameters sample_weight=unary n_clusters=127 data_shape=(263256, 14) max_iter=100...
Running Kmeans numba_dpex lloyd GPU ... done in 0.7 s

Pretty good for kmeans_dpcpp !

(edit: it's more telling with 20x more data:

Running Sklearn vanilla lloyd with parameters sample_weight=unary n_clusters=127 data_shape=(5265120, 14) max_iter=100...
Running Sklearn vanilla lloyd ... done in 30.5 s

Running daal4py lloyd CPU with parameters sample_weight=unary n_clusters=127 data_shape=(5265120, 14) max_iter=100...
Running daal4py lloyd CPU ... done in 16.2 s

Running kmeans_dpcpp lloyd CPU with parameters sample_weight=unary n_clusters=127 data_shape=(5265120, 14) max_iter=100...
Running kmeans_dpcpp lloyd CPU ... done in 47.6 s

Running kmeans_dpcpp lloyd GPU with parameters sample_weight=unary n_clusters=127 data_shape=(5265120, 14) max_iter=100...
Running kmeans_dpcpp lloyd GPU ... done in 7.1 s

Running Kmeans numba_dpex lloyd GPU with parameters sample_weight=unary n_clusters=127 data_shape=(5265120, 14) max_iter=100...
Running Kmeans numba_dpex lloyd GPU ... done in 7.6 s

difference between DPC++ and numba_dpex seem to be mostly python overhead that would mostly disappear with lots of data )

(for information, results with kmeans++ for other implementations:

Running Sklearn vanilla lloyd with parameters sample_weight=unary n_clusters=127 data_shape=(263256, 14) max_iter=100...
Running Sklearn vanilla lloyd ... done in 3.4 s

Running daal4py lloyd CPU with parameters sample_weight=unary n_clusters=127 data_shape=(263256, 14) max_iter=100...
Running daal4py lloyd CPU ... done in 1.0 s

Running Kmeans numba_dpex lloyd GPU with parameters sample_weight=unary n_clusters=127 data_shape=(263256, 14) max_iter=100...
Running Kmeans numba_dpex lloyd GPU ... done in 1.5 s

)

fcharras commented 1 year ago

Reporting in to let you know that we also managed (after having to work around https://github.com/IntelPython/dpctl/issues/1036 and https://github.com/IntelPython/numba-dpex/issues/870) to run the benchmarks on a flex 170 on the intel edge devcloud, with a very nice (5-10)x speedup.

Running Kmeans numba_dpex lloyd GPU with parameters sample_weight=unary n_clusters=127 data_shape=(5265120, 14) max_iter=100...
Running Kmeans numba_dpex lloyd GPU ... done in 2.0 s

And with even more (10x more) data:

Running Kmeans numba_dpex lloyd GPU with parameters sample_weight=unary n_clusters=127 data_shape=(52651200, 14) max_iter=100...
Running Kmeans numba_dpex lloyd GPU ... done in 11.1 s

fcharras commented 1 year ago

(pinging @oleksandr-pavlyk @diptorupd for notice)

soda-inria / sklearn-numba-dpex

Add kmeans_dpcpp to benchmark #75