K-fold target encoding to mitigate overfitting risks

fingoldo commented 6 years ago

Dear maintainers, would you please consider adding into TargetEncoder module ability to compute target means by category in out-of-fold fashion using custom folds generator?

That way, at a fitting stage we would not just compute single values of smoothing for given category of a given column using all available rows. Instead, we would iterate y array fold by fold, computing mean/smoothing over each train fold and using it for rows of each corresponding test fold. This could really allow to reduce overfitting by introducing variability while still conforming to the peculiarities of underlying data.

Coding-wise, I think it could be implementing by modifying just one procedure, namely target_encode in target_encoder.py: tmp[val]['smoothing'] would become a list instead of a scalar value, and we would need adding 2 loops of such kind:

for nfold,(train_idx, valid_idx) in enumerate(folding.split(X, y)):
    #treat X[train_idx] and y[train_idx] as we are treating X and y now 
    #while computing smoothing for each category of each column

    #only use tmp[val]['smoothing'].append(cust_smoothing) to store results 
    #as it will be a list, not scalar

and

for nfold,(train_idx, valid_idx) in enumerate(folding.split(X)):
    transformed_column.iloc[np.argwhere(X[column] == val])[valid_idx]] = \
    switch.get('mapping')[val]['smoothing'][nfold]
    #to apply learned values at the transform step

folding=KFold(1) would then become an additional parameter of TargetEncoder object's signature, default param of it behaving exactly like it does now (ie using whole set at once, with 1 big fold including all data). And for those who needs it, more advanced Kfold will be there as well should they specify more folds or even a custom generator )

Do you think this is feasible to implement?

janmotl commented 6 years ago

Write a pull request and it will be added.

Is my understanding of the problem correct that the described behaviour could also be obtained by wrapping TargetEncoder into a KFold cross-validation, but the proposed solution has a benefit of potentially much better performance because statistics would be calculated just once?

fingoldo commented 6 years ago

I think biggest advantage is having everything in one package and ability to use it in a pipeline ) thank you we will prepare PR in coming month and get back to you then )

JohnnyC08 commented 6 years ago

I know another popular approach is to also calculate the mean using the leave one out approach. Maybe we can have an encoder be a KfoldsEncoder and the TargetEncoder can use either KfoldsEncoder or LeaveOneOutEncoder for it's mean calculation.

Thoughts?

lumliolum commented 5 years ago

Do we have the functionality now in category encoders?

janmotl commented 5 years ago

It's not implemented. PRs are welcomed.

DenisVorotyntsev commented 5 years ago

Hello. I made a benchmark of categorical encoders for several datasets and I found that double validation (i.e. KFold within train data) is a must for target-based encoders. I provided results in my repo.

Also, you might add Double Validation as an additional class, as I made here (DoubleValidationEncoderNumerical). It works for all target-based encoders.

janmotl commented 5 years ago

That looks interesting.

Just a few quick notes: 1) Beware that CatBoostEncoder is sensitive to order of samples -> consider shuffling the rows in the data preparation step. 2) Why some measurements are missing (e.g.: in Table 1.1)? Just write the reason so we know it is not an error in the reporting. 3) The report should mention the used measures (AUC).

I will have a better look at it later.

DenisVorotyntsev commented 5 years ago

Thank you for feedback. The repo is still somewhat raw and I will add more info about versions and experiments settings later, but the results could be used already.

About the notes:

Good point.
It's not an error of the report: I just don't have enough memory to run such experiments (sparse categorical representation + big datasets). Will add a note about it.
Agreed, will add it.

janmotl commented 5 years ago

It would also be interesting to explain why a single internal cross-validation improves AUC of BackwardDifferenceEncoder (surprising) when it does not improve AUC of HelmertEncoder and SumEncoder (not surprising). It must be because of the different formulation of the contrasts.

Also, I do not understand why the internal cross-validation improves AUC of OrdinalEncoder. Is it because it places rare values into a "single basket" called "new value"?

It would also be interesting to use the explanation from the previous two paragraphs to explain why the internal cross-validation helps a lot on some datasets (e.g.: poverty_B) while it does not help on other datasets (e.g.: poverty_A). Of course, some of the differences can be explained by chance. But then some measure of accuracy should be provided (like percentiles in a boxplot...).

DenisVorotyntsev commented 5 years ago

It would also be interesting to explain why a single internal cross-validation improves AUC of BackwardDifferenceEncoder (surprising) when it does not improve AUC of HelmertEncoder and SumEncoder (not surprising). It must be because of the different formulation of the contrasts.

It might be a result of: (1) Better hyperparameters optimization, i.e. during Single Validation, validation folds are encoded the same way as test dataset, so it becomes more similar and number of trees become optimal (at least more optimal than during None Validation); I run experiments for two datasets and number of trees in Single and None Validation was different; (2) Bigger diversity across folds.

I think same logic might be applicable to all kind of encoders (even TFiDF)

Also, I do not understand why the internal cross-validation improves AUC of OrdinalEncoder. Is it because it places rare values into a "single basket" called "new value"?

I think it's because of "new" categories. During Single or None Validation, the model does not know how to deal with new categories in test set. So, Double Validation is some sort of augmentation / regularization of train data.

I combined my findings in the article.

janmotl commented 5 years ago

Notes to the referenced article:

Category representation — Sum Encoding ()

Empty braces

Instead of y+ there is n in the denominator.

What is n? Say in the article. And include reference.

Questions:

It looks like the encoding of "NewCategory" has the same encoding as "D" in Sum Encoding and Helmert encoding. Do you have some proposal how to do it better?
Shouldn't the frequencies in FrequencyEncoder be normalized by the data set size?
Do you have some other suggestions?

DenisVorotyntsev commented 5 years ago

Empty braces

Corrected.

What is n? Say in the article. And include reference.

It was introduced in the beginning of the section.

Shouldn't the frequencies in FrequencyEncoder be normalized by the data set size?

It doesn't matter for tree-based encoders. For other cases, it should be normalized or log and then normalize.

HiIamJeff commented 4 years ago

Hi contributors, wonder what is the timeline for this specific implementation. Would love to include this into the existing pipeline. Thanks!

scikit-learn-contrib / category_encoders

K-fold target encoding to mitigate overfitting risks #133