Controlling for overfitting

interpretml / interpret

Fit interpretable models. Explain blackbox machine learning.

https://interpret.ml/docs

MIT License

6.22k stars 726 forks source link

Controlling for overfitting #176

Closed hakeemo closed 3 years ago

hakeemo commented 3 years ago

This is more of a question than an issue. It seems that the default setting for my dataset of 10k rows and 45 features results in and overfitted model. Decreasing the number of max_rounds seems to help. What are the recommended ways for avoiding overfitting to the data?

interpret-ml commented 3 years ago

Hi @hakeemo - good question.

Hmm, would need some more detail on the following:

How are you separating the train and test set? Is it a standard or time-based holdout/CV? Double check the train and test are roughly the same distribution.
If there's some feature that has orders of magnitude greater feature importance than everything else, there might be a leak.
How much is it overfitting by? Is this something like .82 vs .80 AUROC, or something more catastrophic like 0.9 vs 0.6?

The ways to mitigate overfitting will somewhat depend on the issue you're facing, let us know!

hakeemo commented 3 years ago

Thanks for the response.

It is a time-based holdout. So from what I gather from roughly reading the code for EBM, stratified sampling is used to construct the internal validation set. If that is true, then that would cause a data leak, and I suppose that cannot be used for early stopping.
There are no features that particularly standout in the feature importance.
The overfitting is catastrophic (e.g. 0.9 vs 0.6) If I use the default settings

interpret-ml commented 3 years ago

If you're using a time-based holdout, then there's a good chance you're working in a non-stationary environment (in this case: the data and its relationships are likely changing over time).

Do you see similar overfitting when you run other learners such as random forest / gradient boosting(default without custom validation) or is it specific to EBM?

You are correct in that we use stratified holdout behind the scenes. This shouldn't be an issue if the train and test are assumed to be drawn from the same parent distribution - but it sounds like it may not be the case here.

hakeemo commented 3 years ago

It isn't specific to EBM. For my problem I would always need a time-based validation set (typically TimeSeriesSplit in sklearn)

interpret-ml commented 3 years ago

That makes sense, we'll make an attempt to get external validation sets in the next release (should be this or next week) and will let you know.

If you're going down the path of domain/shift adaptation (co-variate or otherwise) to train the learner for a later time period, I'm guessing you'd want learner-handled sample weights as well. We're working on it, but there's no concrete ETA.

hakeemo commented 3 years ago

@interpret-ml Those features would definitely help! Thanks for your quick response and for this great package.

sarim-zafar commented 1 year ago

@interpret-ml it has been two years since the last update on this and I can't find any external validation set feature?