How are the test scores in cv_results_ and best_score_ calculated?

JSAnandEOS commented 3 years ago

I'm using BayesSearchCV to optimise an XGBoost model to fit some data I have. While the model fits fine, I am puzzled by the scores provided in the diagnostic information and am unable to replicate them.

Here's an example script using the Boston house prices dataset to illustrate my point:

from sklearn.datasets import load_boston

import numpy as np
import pandas as pd

from xgboost.sklearn import XGBRegressor

from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
from sklearn.model_selection import KFold, train_test_split 

boston = load_boston()

# Dataset info:
print(boston.keys())
print(boston.data.shape)
print(boston.feature_names)
print(boston.DESCR)

# Put data into dataframe and label column headers:

data = pd.DataFrame(boston.data)
data.columns = boston.feature_names

# Add target variable to dataframe

data['PRICE'] = boston.target

# Split into X and y

X, y = data.iloc[:, :-1],data.iloc[:,-1]

# Split into training and validation datasets 

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, shuffle = True) 

# For cross-validation, split training data into 5 folds

xgb_kfold = KFold(n_splits = 5,random_state = 42)

# Run fit

xgb_params = {'n_estimators': Integer(10, 3000, 'uniform'),
               'max_depth': Integer(2, 100, 'uniform'),
               'subsample': Real(0.25, 1.0, 'uniform'),
               'learning_rate': Real(0.0001, 0.5, 'uniform'),
               'gamma': Real(0.0001, 1.0, 'uniform'),
               'colsample_bytree': Real(0.0001, 1.0, 'uniform'),
               'colsample_bylevel': Real(0.0001, 1.0, 'uniform'),
               'colsample_bynode': Real(0.0001, 1.0, 'uniform'),
               'min_child_weight': Real(1, 6, 'uniform')}

xgb_fit_params = {'early_stopping_rounds': 15, 'eval_metric': 'mae', 'eval_set': [[X_val, y_val]]}

xgb_pipe = XGBRegressor(random_state = 42,  objective='reg:squarederror', n_jobs = 10)

xgb_cv = BayesSearchCV(xgb_pipe, xgb_params, cv = xgb_kfold, n_iter = 5, n_jobs = 1, random_state = 42, verbose = 4, scoring = None, fit_params = xgb_fit_params)

xgb_cv.fit(X_train, y_train)

After running this, xgb_cv.bestscore is 0.816, and xgb_cv.bestindex is 3. Looking at xgb_cv.cvresults, I want to find the best scores for each fold:

print(xgb_cv.cv_results_['split0_test_score'][xgb_cv.best_index_], xgb_cv.cv_results_['split1_test_score'][xgb_cv.best_index_], xgb_cv.cv_results_['split2_test_score'][xgb_cv.best_index_], xgb_cv.cv_results_['split3_test_score'][xgb_cv.best_index_], xgb_cv.cv_results_['split4_test_score'][xgb_cv.best_index_])

Which gives:

0.8023562337946979,
 0.8337404778903412,
 0.861370681263761,
 0.8749312273014963,
 0.7058815015739375

I'm not sure what's being calculated here, because scoring is set to None in my code. XGBoost's documentation isn't much help, but according to xgb_cv.best_estimator_.score? it's supposed to be the R2 of the predicted values. Anyway, I'm unable to obtain these values when I manually try calculating the score for each fold of the data used in the fit:

# First, need to get the actual indices of the data from each fold:

kfold_indexes = {}
kfold_cnt = 0

for train_index, test_index in xgb_kfold.split(X_train):
    kfold_indexes[kfold_cnt] = {'train': train_index, 'test': test_index}
    kfold_cnt = kfold_cnt+1

# Next, calculate the score for each fold   
for p in range(5): print(xgb_cv.best_estimator_.score(X_train.iloc[kfold_indexes[p]['train']], y_train.iloc[kfold_indexes[p]['train']]))

Which gives me the following:

0.9953058342929344
0.9954795401877629
0.995066439221176
0.9950907019868662
0.995765389875457

How is BayesSearchCV calculating the scores for each fold, and why can't I replicate them using the score function? I would be most grateful for any assistance with this issue.

(Also, manually calculating the mean of these scores gives: 0.8156560..., while xgb_cv.best_score_ gives: 0.8159277... Not sure why there's a precision difference here.)

JSAnandEOS commented 3 years ago

I tried fitting a new XGBoost model for each fold of data using the same parameters as xgb_cv.best_estimator_:

xgb_best_params = xgb_cv.best_estimator_.get_xgb_params()

for p in range(5):

    xgb_temp = XGBRegressor(**xgb_best_params)

    xgb_temp.fit(X_train.iloc[kfold_indexes[p]['train']].values, y_train.iloc[kfold_indexes[p]['train']].values)

    print(xgb_temp.score(X_train.iloc[kfold_indexes[p]['test']].values, y_train.iloc[kfold_indexes[p]['test']].values)

This gives me values that are much closer to the original CV scores, but they are still not exactly the same:

0.81142551, 
0.82771065, 
0.87877982, 
0.88612015, 
0.70819454

Why would there still be a discrepancy if the exact same folds and parameters are used? Is BayesSearchCV doing something else with XGBoost inside the fitting process that I'm not aware of, or is this an issue with precision?

JSAnandEOS commented 3 years ago

I tried replacing BayesSearchCV with sklearn's RandomizedSearchCV, and I my problem disappears when fitting for 5 CV folds. So it would seem that BayesSearchCV is the issue here.

However, when I tried using 2 CV folds, I found that the 2nd fold's score was different as before, so does that mean that there is something wrong with XGBoost instead? Why would this produce two different scores when provided the exact same parameters, data, and random seeds each time?

grudloff commented 3 years ago

Hello @JSAnandEOS! I am having a similar issue myself. I too believe there is some kind of issue with the results of best_score_. In my use case, I aim at scores on the order of e-12 but best_score_ is in the order of e-6 which suggests it should be doing awfully but test score and train score are on e-12. In my case, I pass a scorer function. The only alternative I can think of is that there is a modification of the scoring values being performed somewhere internally, something close to a sqrt() maybe?

kernc commented 3 years ago

BayesSearchCV had been recently revamped. Make sure you're testing with the git master rather than the currently released version.

grudloff commented 3 years ago

@kernc I am using the git master version.

scikit-optimize / scikit-optimize

How are the test scores in cv_results_ and best_score_ calculated? #1013