rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.
https://rasbt.github.io/mlxtend/
Other
4.83k stars 855 forks source link

SequentialFeatureSelector generates different CV average values when StratifiedKFold with fixed random_state is used. #908

Closed utkubzk closed 2 years ago

utkubzk commented 2 years ago

When you call the fit function for example say 30 features, it generates the dictionary with forward = True, floating = False method. Then, if you feed the first 10 features (preserving the order of these features as they were previously) selected from that fit to SFS again, you expect to yield the same CV averages as you fixed random state for both CV and the model.

However, there happens very small deviations in CV average scores. Even if these changes are relatively small, they sometimes lead to selection of different feature. I am not sure whether this problem is related with mlxtend or StratifiedKFold, but it seems problematic.

rasbt commented 2 years ago

Hm, can you post an example to try this? Also, it might be helpful to take a look at random sources. E.g., have you fixed the seed in StratifiedKFold?

THe example doesn't have to be the original dataset if you can't share it, and it could be something like

from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=1000, n_features=30, n_redundant=2, flip_y=0.3,
    n_clusters_per_class=1, random_state=123)
utkubzk commented 2 years ago

Here below you can find a replication example. I used a for loop to find a random seed suitable for this case. rand=1 and rand =9 works for me.

import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedKFold 
from xgboost.sklearn import XGBClassifier 
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
found =False
rand=1
while found==False:
  print('rand:',rand)
  X,y = make_classification(n_samples= 5000, n_features=30,n_redundant=5,n_informative=7,n_classes=2,random_state=rand)
  trainx = pd.DataFrame(X)
  trainy = pd.DataFrame(y)

  model_xgb = XGBClassifier(max_depth=2,scale_pos_weight=200,n_jobs=-1,n_estimators=100,random_state =rand)

  skf = StratifiedKFold(n_splits=3,random_state =rand,shuffle =True)

  #Fit with 30 features
  ml_SFS = SFS(model_xgb, k_features='parsimonious',
  forward=True,
  floating=False,
  verbose=2,
  scoring='roc_auc',
  n_jobs=1,
  cv=skf)
  ml_SFS = ml_SFS.fit(trainx,trainy.squeeze())
  dic_1 = ml_SFS.get_metric_dict()
  #Select top 10
  features = [i for i in trainx.columns if i in dic_1[10]['feature_names']]

 #Fit with selected 10 features
  ml_SFS_2 = SFS(model_xgb, k_features='parsimonious',
  forward=True,
  floating=False,
  verbose=2,
  scoring='roc_auc',
  n_jobs=1,
  cv=skf)
  ml_SFS_2 = ml_SFS_2.fit(trainx.loc[:,features],trainy.squeeze())
  dic_2 = ml_SFS_2.get_metric_dict()
  rand+=1
  avg_1=[dic_1[i]['avg_score'] for i in np.arange(1,11)]
  avg_2=[dic_2[i]['avg_score'] for i in np.arange(1,11)]
  found=not (avg_1==avg_2)
rasbt commented 2 years ago

Thanks for providing the code. When I run this on my computer, it never exits the while loop though, so I wonder if this is maybe an XGBoost problem. XGBoost has a lot of approximations internally, and I wonder if that's related. Have you tried this with any scikit-learn classifier? E.g, the HistGradientBoostingClassifier?

utkubzk commented 2 years ago

Thanks for trying it. I realized that this happens when you use xgboost version 1.0.0 and above. Additionally, you are right. This is not the case if a sklearn classifier is used such as HistGradientBoostingClassifier. Therefore, it is probably not related to mlxtend package. However, I really wonder why this happens when XGBClassifier from xgboost version 1.0.0 and above is used, that is interesting.

rasbt commented 2 years ago

Thanks for checking, glad it's not a probably hard-to-debug mlxtend issue. I think XGboost uses a lot of tricks for making things fast, including float32 casting, which could lead to rounding errors. It should be deterministic when run on a CPU, but in practice it may be different.