Open fingoldo opened 6 years ago
Write a pull request and it will be added.
Is my understanding of the problem correct that the described behaviour could also be obtained by wrapping TargetEncoder into a KFold cross-validation, but the proposed solution has a benefit of potentially much better performance because statistics would be calculated just once?
I think biggest advantage is having everything in one package and ability to use it in a pipeline ) thank you we will prepare PR in coming month and get back to you then )
I know another popular approach is to also calculate the mean using the leave one out approach. Maybe we can have an encoder be a KfoldsEncoder
and the TargetEncoder
can use either KfoldsEncoder
or LeaveOneOutEncoder
for it's mean calculation.
Thoughts?
Do we have the functionality now in category encoders?
It's not implemented. PRs are welcomed.
Hello. I made a benchmark of categorical encoders for several datasets and I found that double validation (i.e. KFold within train data) is a must for target-based encoders. I provided results in my repo.
Also, you might add Double Validation as an additional class, as I made here (DoubleValidationEncoderNumerical). It works for all target-based encoders.
That looks interesting.
Just a few quick notes: 1) Beware that CatBoostEncoder is sensitive to order of samples -> consider shuffling the rows in the data preparation step. 2) Why some measurements are missing (e.g.: in Table 1.1)? Just write the reason so we know it is not an error in the reporting. 3) The report should mention the used measures (AUC).
I will have a better look at it later.
Thank you for feedback. The repo is still somewhat raw and I will add more info about versions and experiments settings later, but the results could be used already.
About the notes:
It would also be interesting to explain why a single internal cross-validation improves AUC of BackwardDifferenceEncoder (surprising) when it does not improve AUC of HelmertEncoder and SumEncoder (not surprising). It must be because of the different formulation of the contrasts.
Also, I do not understand why the internal cross-validation improves AUC of OrdinalEncoder. Is it because it places rare values into a "single basket" called "new value"?
It would also be interesting to use the explanation from the previous two paragraphs to explain why the internal cross-validation helps a lot on some datasets (e.g.: poverty_B) while it does not help on other datasets (e.g.: poverty_A). Of course, some of the differences can be explained by chance. But then some measure of accuracy should be provided (like percentiles in a boxplot...).
It would also be interesting to explain why a single internal cross-validation improves AUC of BackwardDifferenceEncoder (surprising) when it does not improve AUC of HelmertEncoder and SumEncoder (not surprising). It must be because of the different formulation of the contrasts.
It might be a result of: (1) Better hyperparameters optimization, i.e. during Single Validation, validation folds are encoded the same way as test dataset, so it becomes more similar and number of trees become optimal (at least more optimal than during None Validation); I run experiments for two datasets and number of trees in Single and None Validation was different; (2) Bigger diversity across folds.
I think same logic might be applicable to all kind of encoders (even TFiDF)
Also, I do not understand why the internal cross-validation improves AUC of OrdinalEncoder. Is it because it places rare values into a "single basket" called "new value"?
I think it's because of "new" categories. During Single or None Validation, the model does not know how to deal with new categories in test set. So, Double Validation is some sort of augmentation / regularization of train data.
I combined my findings in the article.
Notes to the referenced article:
Category representation — Sum Encoding ()
Empty braces
Instead of y+ there is n in the denominator.
What is n
? Say in the article. And include reference.
Questions:
Empty braces
Corrected.
What is n? Say in the article. And include reference.
It was introduced in the beginning of the section.
Shouldn't the frequencies in FrequencyEncoder be normalized by the data set size?
It doesn't matter for tree-based encoders. For other cases, it should be normalized or log and then normalize.
Hi contributors, wonder what is the timeline for this specific implementation. Would love to include this into the existing pipeline. Thanks!
Dear maintainers, would you please consider adding into TargetEncoder module ability to compute target means by category in out-of-fold fashion using custom folds generator?
That way, at a fitting stage we would not just compute single values of smoothing for given category of a given column using all available rows. Instead, we would iterate y array fold by fold, computing mean/smoothing over each train fold and using it for rows of each corresponding test fold. This could really allow to reduce overfitting by introducing variability while still conforming to the peculiarities of underlying data.
Coding-wise, I think it could be implementing by modifying just one procedure, namely target_encode in target_encoder.py: tmp[val]['smoothing'] would become a list instead of a scalar value, and we would need adding 2 loops of such kind:
and
folding=KFold(1) would then become an additional parameter of TargetEncoder object's signature, default param of it behaving exactly like it does now (ie using whole set at once, with 1 big fold including all data). And for those who needs it, more advanced Kfold will be there as well should they specify more folds or even a custom generator )
Do you think this is feasible to implement?