scikit-learn-contrib / imbalanced-learn

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
https://imbalanced-learn.org
MIT License
6.85k stars 1.29k forks source link

scikit-learn-intelex integration #991

Closed napetrov closed 1 year ago

napetrov commented 1 year ago

There was resent experiments on accelerating imbalanced-learn library with scikit-learn-intelex estimators and results are quite promising - https://medium.com/intel-analytics-software/why-pay-more-for-machine-learning-893683bd78e4

image

So we are talking about pretty noticeable speedups up to 140 times that would benefit imbalance-learn users. What are your thought on providing more tight integration?

There are multiple options that can be used here:

Verry open for integration options discussion and would be happy to address questions/concerns or suggestions here.

glemaitre commented 1 year ago

@napetrov I am currently making the release compatible with the latest change in scikit-learn 1.3.

My vision here would be to allow to use scikit-learn-intelx. If one explicitly activates the package, then imbalanced-learn can use it internally.

From what I recall, if a user just patches scikit-learn:

from sklearnex import patch_sklearn
patch_sklearn()

Then, all the import scikit-learn imports will then use the Intel versions.

So in the end, I don't think that we need to change anything in the codebase, isn't it? What we would need is to have some documentation in the installation process and potentially have a CI run to be sure that the tests are passing with the latest scikit-learn-intelx.

napetrov commented 1 year ago

@glemaitre - yes, correct. Documentation and CI would be good base steps. And patch() is most basic one - other alternatives would be to pass algorithm objects in to imblearn.

from imblearn.under_sampling import EditedNearestNeighbours
**from sklearnex.neighbors** import NearestNeighbors
...
nn = NearestNeighbors(n_neighbors=4, n_jobs=-1)
X_resampled, y_resampled = EditedNearestNeighbours(n_neighbors=nn).fit_resample(X, y)

It worth mentioning both in documentation and explain difference - with patch() call you would apply this for all scikit calls in script, while with direct exports you can do this for imblearn only.

Can start initial doc input if this would help:

  1. Adding examples
  2. Dependency in getting started - ether in current list marked as optional or new block for optional deps
  3. Probably performance section in to User Guide? I don't see other good places in current structure.

Or other recommendations/suggestions are welcome.

As for code changes - this is an option for a more granular control within imbalanced learn itself. For example we have this with PyCaret and AutoGluon - frameworks themself are aware of scikit-learn-intelex package they are using from sklearnex import instead of from sklearn imports in case they detect sklearnex package in environment, but dependency is not enforced in base install, only in full optional deps. So this gives ability to use relevant pieces more consciously.

glemaitre commented 1 year ago

frameworks themself are aware of scikit-learn-intelex package they are using from sklearnex import instead of from sklearn imports in case they detect sklearnex package in environment, but dependency is not enforced in base install, only in full optional deps. So this gives ability to use relevant pieces more consciously.

I prefer to have an explicit way of indicating that you want to use sklearnex. Internally, I don't think that there is a huge drawback to not having granular control. I am more worried about making a magical choice for the user. While working in scikit-learn issue tracker, we saw already a couple of bugs reported where the user, even by being explicit with the patching, does not get the source of the bugs.

Since I am already struggling to maintain this package, I would not go on the road of making automatic backend switch.

glemaitre commented 1 year ago

I added a section in the documentation install guideline.