intel / scikit-learn-intelex

Intel(R) Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application
https://intel.github.io/scikit-learn-intelex/
Apache License 2.0
1.22k stars 174 forks source link

False error "Missing values are not supported in daal4py Gradient Boosting Trees" #960

Closed dragarok closed 1 year ago

dragarok commented 2 years ago

Can somebody tell me how can I resolve this issue? I have tried looking at my data and find nulls. I can't resolve this issue. This is just a help post. Any help to guide me to the correct approach is appreciated.

dragarok commented 2 years ago

I found an issue here in github regarding this. I checked my whole dataframe for null values using df.isnull().values.any() but it all results to false.

Alexsandruss commented 2 years ago

Can you provide XGBoost training parameters or reproducer code? Problems with missing values are known, one possible reason is 'gpu_hist' training method which output XGBoost booster with active missing values indicators even if training data hasn't them.

dragarok commented 2 years ago

Can you provide XGBoost training parameters or reproducer code? Problems with missing values are known, one possible reason is 'gpu_hist' training method which output XGBoost booster with active missing values indicators even if training data hasn't them.

Oh yes. I am using 'gpu_hist' as training tree method. So, is there no way to train on gpu and get model for sure to work with daal?

Alexsandruss commented 2 years ago

For now, xgboost models from gpu_hist method are not working with daal4py. This behavior is not expected and probably bug, but its source (daal4py or xgboost) is non-obvious since hist is working. I will investigate it.

dklein0 commented 2 years ago

@Alexsandruss - Is there any workaround?

Alexsandruss commented 2 years ago

It looks like problem was solved from XGBoost side: boosters created by gpu_hist tree method are easily translated to DAAL models now if no missing values are presented (DAAL inference is not supporting them).

SW used to check this case: Python 3.9.13, XGBoost 1.6.1 pip package, daal4py/oneDAL 2021.5.0 conda-forge packages. Driver Version: 515.65.01, CUDA Version: 11.7.

HW: Tesla T4 GPU for gpu_hist training.

Testing script:

import xgboost as xgb
import daal4py as d4p
import numpy as np
from sklearn.datasets import make_classification

x, y = make_classification(n_samples=10000, n_features=16, n_classes=2)
xgb_clsf = xgb.XGBClassifier(tree_method='gpu_hist')
xgb_clsf.fit(x, y)

booster = xgb_clsf.get_booster()

xgb_prediction = xgb_clsf.predict(x)
xgb_errors_count = np.count_nonzero(xgb_prediction - y)

daal_model = d4p.get_gbt_model_from_xgboost(booster)

daal_predict_algo = d4p.gbt_classification_prediction(
    nClasses=2,
    resultsToEvaluate="computeClassLabels",
    fptype='float'
)
daal_prediction = daal_predict_algo.compute(x, daal_model).prediction.astype('int').ravel()
daal_errors_count = np.count_nonzero(daal_prediction - y)

assert np.absolute(xgb_errors_count - daal_errors_count) == 0
Alexsandruss commented 2 years ago

Renamed issue to be less confusing

dklein0 commented 2 years ago

I am also running xgboost 1.6.1 and the problem exists. I am creating a regressor, not a classifier. Probably more to the point, on a data set of about 75,000 records it worked fine. My full data set is over 5,000,000 records and only then did the problem appear.

I did find a workaround: After training my hyperparameters with 'gpu_hist', I retrain the model one last time with the best hyperparameters and 'hist' and then create the daal model. In any case, I believe the bug still exists.

Alexsandruss commented 2 years ago

I run regressor on synthetic data with 7.5M x 64 shape and got no error. @dklein0, can you share the origin of your data? Does it have NaN/Inf values?

dklein0 commented 2 years ago

My data is financial data that I have pre-processed with a C# program to create my feature set.

The feature set has NaN values, but I use a panda DataFrame, df, and call df.dropna(inplace=true) before using it. I also checked that numpy.isinf(df).values.sum() returns zero.

On Wed, Aug 10, 2022 at 6:01 PM Alexander Andreev @.***> wrote:

I run regressor on synthetic data with 7.5M x 64 shape and got no error. @dklein0 https://github.com/dklein0, can you share the origin of your data? Does it have NaN/Inf values?

— Reply to this email directly, view it on GitHub https://github.com/intel/scikit-learn-intelex/issues/960#issuecomment-1210801052, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD3UHZCJC5SKLW7HA7L5A2DVYO76BANCNFSM5NLLHNFA . You are receiving this because you were mentioned.Message ID: @.***>

napetrov commented 1 year ago

this would be fixed with adding missing values support - pr on oneDAL side for this - https://github.com/oneapi-src/oneDAL/pull/2345