Closed onacrame closed 1 year ago
Hi @onacrame,
Great point -- our default validation sampling does stratify across the label, but unfortunately does not customize beyond that. Adding support for custom validation sets (which are only used for early stopping) is on our backlog, but has not been implemented yet.
An iterator is an interesting idea. We were also thinking about supplementing the fit call to take in a user defined validation_set = (X_val, y_val)
as another option (which we would then sample from for each bag of data). Would be interested to hear your thoughts on different options for defining this!
-InterpretML Team
Hi @onacrame,
Great point -- our default validation sampling does stratify across the label, but unfortunately does not customize beyond that. Adding support for custom validation sets (which are only used for early stopping) is on our backlog, but has not been implemented yet.
An iterator is an interesting idea. We were also thinking about supplementing the fit call to take in a user defined
validation_set = (X_val, y_val)
as another option (which we would then sample from for each bag of data). Would be interested to hear your thoughts on different options for defining this!-InterpretML Team
Defining the validation set would be a great option as one can just use whatever sklearn-type iterators one wants and keeping Interpret-ML api simpler. So default option would be as it is now but with the ability to pass in a user defined validation set.
@interpret-ml
In catboost (https://catboost.ai/docs/concepts/python-reference_catboostclassifier_fit.html#python-reference_catboostclassifier_fit), xgboost (https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn) and lightgbm (https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html) you have an eval_set
parameter for the fit() method, that you can use to provide "A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed".
Another ancillary point is that typically after a model building process is finished, it's customary to train the final model on all the data, using whatever early stopping thresholds were found during cross validation or by found while running against a validation set. The EBM framework doesn't really allow for this given that there's always a holdout set and no "refit" of the model without the validation set, so there will always be some portion of the data that cannot be used in the final model.
Just an observation.
Another problem is that if, for example, you oversampled a class in the training set, you should not have an oversampled validation set (the validation set distribution should be similar to the test set distribution and to the live data distribution). If you split the validation set from the training set, you inherit the oversampled training set distribution. This is also true if you perform data augmentation on the training set. Splitting the validation set from the training set is often a bad idea.
Any timeline as to how soon this feature will be incorporated? This is extremely crucial esp. for problems where you cant randomly split the data.
This can now be accomplished with the bags parameter. Details in our docs: https://interpret.ml/docs/ebm.html#interpret.glassbox.ExplainableBoostingClassifier.fit
Correct me if I'm wrong but the native early stopping mechanism within EBM will just take a random slice of the data. In the case of (i) grouped observations (panel data where one ID might relate to multiple rows of data) or (ii) imbalanced data where one might want to ensure stratification, a random cut may not be optimal. Is there any way to use an iterator to predefine which slice of the data is used for early stopping?