intel / scikit-learn-intelex

Intel(R) Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application
https://intel.github.io/scikit-learn-intelex/
Apache License 2.0
1.2k stars 170 forks source link

Variability in results using sklearnex with ExtraTrees and RandomForest classifiers #1916

Open YoochanMyung opened 3 weeks ago

YoochanMyung commented 3 weeks ago

Describe the bug Getting different results by turning on/off sklearnex with ExtraTrees and RandomForest algorithms. This issue occurs starting with version 2024.1. I found it with my own dataset, and it's also reproducible with the breast_cancerdataset, but not with the iris dataset.

To Reproduce

  1. Setup 'scikit-learn==1.5.1' (any version from 1.2.1)
  2. Setup 'scikit-learn-intelex==2024.1' (any version from 2024.1)
  3. Run the following test code:
    
    import pandas as pd

from sklearnex import patch_sklearn patch_sklearn()

from xgboost import XGBClassifier from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier from sklearn.metrics import multilabel_confusion_matrix, confusion_matrix

from sklearn.model_selection import cross_val_predict, train_test_split from sklearn.preprocessing import LabelEncoder, StandardScaler, label_binarize from sklearn.metrics import matthews_corrcoef, confusion_matrix N_CORES = 16

Toy Data

from sklearn.datasets import load_iris,load_breast_cancer data = load_breast_cancer() X = data['data'] y = data['target'] X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=1)

ExtraTrees

classifier_cv = ExtraTreesClassifier(n_estimators=300, random_state=1, n_jobs=N_CORES) classifier_test = ExtraTreesClassifier(n_estimators=300, random_state=1, n_jobs=N_CORES)

cv_results = cross_val_predict(classifier_cv, X_train, y_train, cv=10) classifier_test.fit(X_train, y_train)

test_results = classifier_test.predict(X_test) print("###CV###") print(matthews_corrcoef(y_train, cv_results)) print(confusion_matrix(y_train,cv_results).ravel())

print("###TEST###") print(matthews_corrcoef(y_test, test_results)) print(confusion_matrix(y_test,test_results).ravel())


**Expected behavior**
Same results between using sklearnex and original sklearn.

**Output/Screenshots**

Before patching sklearnex with ExtraTrees

CV

0.935861738490973 [144 5 7 242]

TEST

0.9247930594534806 [ 58 5 1 107]


After patching sklearnex with ExtraTrees

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)

CV

0.9409328452526324 [143 6 5 244]

TEST

0.8992907835033845 [ 57 6 2 106]



**Environment:**
 - OS: Ubuntu 22.04.04 LTS
- Scikit-learn==1.5.1 but I tested on 1.2.1, 1.3.x, 1.4.x.. etc.
YoochanMyung commented 3 weeks ago

Not sure whether it's related but if I use Intelex, I got a warning UserWarning: X does not have valid feature names, but ExtraTreesClassifier was fitted with feature names. Maybe there is a glitch in terms of handling the feature names or their orders by Intelex?