skrub-data / skrub

Prepping tables for machine learning
https://skrub-data.org/
BSD 3-Clause "New" or "Revised" License
1.09k stars 96 forks source link

FEAT Develop the `AggJoiner` and `AggTarget` #734

Open Vincent-Maladiere opened 1 year ago

Vincent-Maladiere commented 1 year ago

This meta-issue is a roadmap for the developments of the AggJoiner and AggTarget estimators, currently implemented in #600.

We need to merge the following PRs before tackling any of the developments below.

This is not a personal roadmap. Anyone is welcome to contribute! 🙂

Any other ideas?

TheooJ commented 8 months ago

I'm working on adding the key argument, and the MultiAggJoiner :)

MaxHalford commented 8 months ago

Hey! Would having Bayesian means be useful? It seems to me the spirit of skrub is to provide good defaults to users. One thing that can happen with AggTarget and AggJoiner out of the box is to do aggregates on small groups, which can lead to overfitting

jeromedockes commented 8 months ago

shrinking towards the overall aggregate across groups sounds like a useful option to add

jeromedockes commented 8 months ago

I guess it only applies for some of the aggregation operations

MaxHalford commented 8 months ago

I guess it only applies for some of the aggregation operations

My initial understand of skrub is that it should provide good defaults to users. It's nice to have AggJoiner and AggTarget be able to compute a variety of statistics. But in practice (e.g. Kaggle) it's more or less sufficient to compute the mean. So if these classes are going to be used as part of TableVectorizer, maybe the default could be a mean that shrinks towards the overall mean. I think this would be a good pit of success for most users.

Vincent-Maladiere commented 8 months ago

Hi @MaxHalford, that sounds like a good idea, and something that is already performed in scikit-learn TargetEncoder. We should create a small benchmark to explore this idea in AggJoiner and AggTarget.

TheooJ commented 7 months ago

I'd be interested to work on screening when I'm done with the multi joiners. I think it's a feature that might also be useful for other estimators than the Joiners, if you think that's the case we can open a new issue on this topic

jeromedockes commented 5 months ago

as discussed with @TheooJ , the AggTarget does not implement cross-fitting (see eg the target encoder doc) which can cause serious overfitting of the downstream estimator. moreover shrinking/smoothing is probably important when there are some values with few matches in the joining column