Closed fcharras closed 1 year ago
About the unrelated problems:
conda install instructions temporarily fix in https://github.com/soda-inria/sklearn-numba-dpex/pull/76 waiting for https://github.com/IntelPython/dpctl/issues/1022
docker image is being rebuilt and will be available in ~1hr from now (see https://github.com/soda-inria/sklearn-numba-dpex/pull/77 )
Please take note of this if running into issues when running the benchmark.
Also fixed the scikit-learn-intelex==2023.0.0
issue in https://github.com/soda-inria/sklearn-numba-dpex/pull/78 and rebased this branch which can be used safely again
However, in the docker image, dpctl
doesn't work with cpu devices, issues reported in https://github.com/IntelPython/dpctl/issues/1023
I'd advise our conda install guide which pins dpcpp_runtime<2023.0.0
in the meantime.
@oleksandr-pavlyk
Update on the benchmark after the last updates:
Running Sklearn vanilla lloyd with parameters sample_weight=unary n_clusters=127 data_shape=(263256, 14) max_iter=100...
Running Sklearn vanilla lloyd ... done in 1.6 s
Running daal4py lloyd CPU with parameters sample_weight=unary n_clusters=127 data_shape=(263256, 14) max_iter=100...
Running daal4py lloyd CPU ... done in 0.7 s
Running kmeans_dpcpp lloyd GPU with parameters sample_weight=unary n_clusters=127 data_shape=(263256, 14) max_iter=100...
Running kmeans_dpcpp lloyd GPU ... done in 0.4 s
Running Kmeans numba_dpex lloyd GPU with parameters sample_weight=unary n_clusters=127 data_shape=(263256, 14) max_iter=100...
Running Kmeans numba_dpex lloyd GPU ... done in 0.7 s
Pretty good for kmeans_dpcpp
!
(edit: it's more telling with 20x more data:
Running Sklearn vanilla lloyd with parameters sample_weight=unary n_clusters=127 data_shape=(5265120, 14) max_iter=100...
Running Sklearn vanilla lloyd ... done in 30.5 s
Running daal4py lloyd CPU with parameters sample_weight=unary n_clusters=127 data_shape=(5265120, 14) max_iter=100...
Running daal4py lloyd CPU ... done in 16.2 s
Running kmeans_dpcpp lloyd CPU with parameters sample_weight=unary n_clusters=127 data_shape=(5265120, 14) max_iter=100...
Running kmeans_dpcpp lloyd CPU ... done in 47.6 s
Running kmeans_dpcpp lloyd GPU with parameters sample_weight=unary n_clusters=127 data_shape=(5265120, 14) max_iter=100...
Running kmeans_dpcpp lloyd GPU ... done in 7.1 s
Running Kmeans numba_dpex lloyd GPU with parameters sample_weight=unary n_clusters=127 data_shape=(5265120, 14) max_iter=100...
Running Kmeans numba_dpex lloyd GPU ... done in 7.6 s
difference between DPC++ and numba_dpex
seem to be mostly python overhead that would mostly disappear with lots of data
)
(for information, results with kmeans++
for other implementations:
Running Sklearn vanilla lloyd with parameters sample_weight=unary n_clusters=127 data_shape=(263256, 14) max_iter=100...
Running Sklearn vanilla lloyd ... done in 3.4 s
Running daal4py lloyd CPU with parameters sample_weight=unary n_clusters=127 data_shape=(263256, 14) max_iter=100...
Running daal4py lloyd CPU ... done in 1.0 s
Running Kmeans numba_dpex lloyd GPU with parameters sample_weight=unary n_clusters=127 data_shape=(263256, 14) max_iter=100...
Running Kmeans numba_dpex lloyd GPU ... done in 1.5 s
)
Reporting in to let you know that we also managed (after having to work around https://github.com/IntelPython/dpctl/issues/1036 and https://github.com/IntelPython/numba-dpex/issues/870) to run the benchmarks on a flex 170 on the intel edge devcloud, with a very nice (5-10)x speedup.
Running Kmeans numba_dpex lloyd GPU with parameters sample_weight=unary n_clusters=127 data_shape=(5265120, 14) max_iter=100...
Running Kmeans numba_dpex lloyd GPU ... done in 2.0 s
And with even more (10x more) data:
Running Kmeans numba_dpex lloyd GPU with parameters sample_weight=unary n_clusters=127 data_shape=(52651200, 14) max_iter=100...
Running Kmeans numba_dpex lloyd GPU ... done in 11.1 s
(pinging @oleksandr-pavlyk @diptorupd for notice)
Benchmark results:
kmeans_dpcpp
runs but do not return good results. (the warningusually hints that the output is random)
numba_dpex
implementation is faster (NB: the benchmark ignore JIT overhead) but maybework_group_size
andcentroids_window_height
must be tinkered forkmeans_dpcpp
. Also I thinkkmeans_dpcpp
implements a version that we had before https://github.com/soda-inria/sklearn-numba-dpex/pull/51 and https://github.com/soda-inria/sklearn-numba-dpex/pull/49, that changed the heuristic used to choose the window over centroids, and improved performance noticeably while pruning the code overall. I'd recommend copying this heuristic frommain
.main
forKMeans
has stabilized and we don't have plan to change it as much as it was being rewritten at the time (sorry for that).Unrelated: apparently,
scikit-learn-intelex
bump to2023
broke the benchmarkssklearnex
. There are also issues with the docker image that has been automatically built yesterday night, since then our test pipeline doesn't pass on main. I'm gonna look into those issues. In the meantime I think the docker image atjjerphan/numba_dpex_dev
is broken, the conda install suggested in the README should be preferred.edit: fixed all unrelated things and rebased, the benchmark can now be tested in stable conditions :pray: