ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
34.03k stars 5.78k forks source link

Distributed Scikit-learn / Joblib didn't work for logistic/linear regression #47774

Open FFFFFFFHHHHHHH opened 1 month ago

FFFFFFFHHHHHHH commented 1 month ago

Description

I used joblib to speed up sklearn’s logistic regression, but didn’t actually get a performance.

When I use a documented case(https://docs.ray.io/en/latest/ray-more-libs/joblib.html#run-on-a-cluster), I get a significant boost: 5m -> 20s.

image

Use case

No response

wingkitlee0 commented 1 month ago

The example linked refers to some kind of parameter search RandomizedSearchCV. Joblib will run each set of parameters in parallel.

In your example, it is a single model fit (one set of hyperparameters). It cannot be parallelized by joblib directly (i.e., the underlying linear algebra; those are parallelized by openmp via setting OMP_NUM_THREADS higher than 1..)

You can try changing the backend to see if the timing matches your expectation.