soda-inria / sklearn-numba-dpex

Experimental plugin for scikit-learn to be able to run (some estimators) on Intel GPUs via numba-dpex.
BSD 3-Clause "New" or "Revised" License
15 stars 4 forks source link

Add kmeans_dpcpp to benchmark #75

Closed fcharras closed 1 year ago

fcharras commented 1 year ago

Benchmark results:

Running Sklearn vanilla lloyd with parameters sample_weight=unary n_clusters=127 data_shape=(263256, 14) max_iter=100...
Running Sklearn vanilla lloyd ... done in 1.3 s

Running kmeans_dpcpp lloyd GPU with parameters sample_weight=unary n_clusters=127 data_shape=(263256, 14) max_iter=100...
/sklearn-numba-dpex/benchmark/./kmeans.py:108: ConvergenceWarning: Number of distinct clusters (32) found smaller than n_clusters (127). Possibly due to duplicate points in X.
  KMeans(**est_kwargs).set_params(max_iter=1).fit(
/sklearn-numba-dpex/benchmark/./kmeans.py:116: ConvergenceWarning: Number of distinct clusters (28) found smaller than n_clusters (127). Possibly due to duplicate points in X.
  estimator.fit(X, sample_weight=sample_weight)
Running kmeans_dpcpp lloyd GPU ... done in 1.2 s

Running Kmeans numba_dpex lloyd GPU with parameters sample_weight=unary n_clusters=127 data_shape=(263256, 14) max_iter=100...
Running Kmeans numba_dpex lloyd GPU ... done in 0.7 s
/sklearn-numba-dpex/benchmark/./kmeans.py:108: ConvergenceWarning: Number of distinct clusters (32) found smaller than n_clusters (127). Possibly due to duplicate points in X.

usually hints that the output is random)

Unrelated: apparently, scikit-learn-intelex bump to 2023 broke the benchmarks sklearnex. There are also issues with the docker image that has been automatically built yesterday night, since then our test pipeline doesn't pass on main. I'm gonna look into those issues. In the meantime I think the docker image at jjerphan/numba_dpex_dev is broken, the conda install suggested in the README should be preferred.

edit: fixed all unrelated things and rebased, the benchmark can now be tested in stable conditions :pray:

fcharras commented 1 year ago

About the unrelated problems:

Please take note of this if running into issues when running the benchmark.

fcharras commented 1 year ago

Also fixed the scikit-learn-intelex==2023.0.0 issue in https://github.com/soda-inria/sklearn-numba-dpex/pull/78 and rebased this branch which can be used safely again

However, in the docker image, dpctl doesn't work with cpu devices, issues reported in https://github.com/IntelPython/dpctl/issues/1023

I'd advise our conda install guide which pins dpcpp_runtime<2023.0.0 in the meantime.

fcharras commented 1 year ago

@oleksandr-pavlyk

Update on the benchmark after the last updates:

Running Sklearn vanilla lloyd with parameters sample_weight=unary n_clusters=127 data_shape=(263256, 14) max_iter=100...
Running Sklearn vanilla lloyd ... done in 1.6 s

Running daal4py lloyd CPU with parameters sample_weight=unary n_clusters=127 data_shape=(263256, 14) max_iter=100...
Running daal4py lloyd CPU ... done in 0.7 s

Running kmeans_dpcpp lloyd GPU with parameters sample_weight=unary n_clusters=127 data_shape=(263256, 14) max_iter=100...
Running kmeans_dpcpp lloyd GPU ... done in 0.4 s

Running Kmeans numba_dpex lloyd GPU with parameters sample_weight=unary n_clusters=127 data_shape=(263256, 14) max_iter=100...
Running Kmeans numba_dpex lloyd GPU ... done in 0.7 s

Pretty good for kmeans_dpcpp !

(edit: it's more telling with 20x more data:

Running Sklearn vanilla lloyd with parameters sample_weight=unary n_clusters=127 data_shape=(5265120, 14) max_iter=100...
Running Sklearn vanilla lloyd ... done in 30.5 s

Running daal4py lloyd CPU with parameters sample_weight=unary n_clusters=127 data_shape=(5265120, 14) max_iter=100...
Running daal4py lloyd CPU ... done in 16.2 s

Running kmeans_dpcpp lloyd CPU with parameters sample_weight=unary n_clusters=127 data_shape=(5265120, 14) max_iter=100...
Running kmeans_dpcpp lloyd CPU ... done in 47.6 s

Running kmeans_dpcpp lloyd GPU with parameters sample_weight=unary n_clusters=127 data_shape=(5265120, 14) max_iter=100...
Running kmeans_dpcpp lloyd GPU ... done in 7.1 s

Running Kmeans numba_dpex lloyd GPU with parameters sample_weight=unary n_clusters=127 data_shape=(5265120, 14) max_iter=100...
Running Kmeans numba_dpex lloyd GPU ... done in 7.6 s

difference between DPC++ and numba_dpex seem to be mostly python overhead that would mostly disappear with lots of data )

(for information, results with kmeans++ for other implementations:

Running Sklearn vanilla lloyd with parameters sample_weight=unary n_clusters=127 data_shape=(263256, 14) max_iter=100...
Running Sklearn vanilla lloyd ... done in 3.4 s

Running daal4py lloyd CPU with parameters sample_weight=unary n_clusters=127 data_shape=(263256, 14) max_iter=100...
Running daal4py lloyd CPU ... done in 1.0 s

Running Kmeans numba_dpex lloyd GPU with parameters sample_weight=unary n_clusters=127 data_shape=(263256, 14) max_iter=100...
Running Kmeans numba_dpex lloyd GPU ... done in 1.5 s

)

fcharras commented 1 year ago

Reporting in to let you know that we also managed (after having to work around https://github.com/IntelPython/dpctl/issues/1036 and https://github.com/IntelPython/numba-dpex/issues/870) to run the benchmarks on a flex 170 on the intel edge devcloud, with a very nice (5-10)x speedup.

Running Kmeans numba_dpex lloyd GPU with parameters sample_weight=unary n_clusters=127 data_shape=(5265120, 14) max_iter=100...
Running Kmeans numba_dpex lloyd GPU ... done in 2.0 s

And with even more (10x more) data:

Running Kmeans numba_dpex lloyd GPU with parameters sample_weight=unary n_clusters=127 data_shape=(52651200, 14) max_iter=100...
Running Kmeans numba_dpex lloyd GPU ... done in 11.1 s
fcharras commented 1 year ago

(pinging @oleksandr-pavlyk @diptorupd for notice)