Closed hakeemo closed 3 years ago
Hi @hakeemo - good question.
Hmm, would need some more detail on the following:
The ways to mitigate overfitting will somewhat depend on the issue you're facing, let us know!
Thanks for the response.
It is a time-based holdout. So from what I gather from roughly reading the code for EBM, stratified sampling is used to construct the internal validation set. If that is true, then that would cause a data leak, and I suppose that cannot be used for early stopping.
There are no features that particularly standout in the feature importance.
The overfitting is catastrophic (e.g. 0.9 vs 0.6) If I use the default settings
If you're using a time-based holdout, then there's a good chance you're working in a non-stationary environment (in this case: the data and its relationships are likely changing over time).
Do you see similar overfitting when you run other learners such as random forest / gradient boosting(default without custom validation) or is it specific to EBM?
You are correct in that we use stratified holdout behind the scenes. This shouldn't be an issue if the train and test are assumed to be drawn from the same parent distribution - but it sounds like it may not be the case here.
It isn't specific to EBM. For my problem I would always need a time-based validation set (typically TimeSeriesSplit in sklearn)
That makes sense, we'll make an attempt to get external validation sets in the next release (should be this or next week) and will let you know.
If you're going down the path of domain/shift adaptation (co-variate or otherwise) to train the learner for a later time period, I'm guessing you'd want learner-handled sample weights as well. We're working on it, but there's no concrete ETA.
@interpret-ml Those features would definitely help! Thanks for your quick response and for this great package.
@interpret-ml it has been two years since the last update on this and I can't find any external validation set feature?
This is more of a question than an issue. It seems that the default setting for my dataset of 10k rows and 45 features results in and overfitted model. Decreasing the number of max_rounds seems to help. What are the recommended ways for avoiding overfitting to the data?