skrub-data / skrub

Prepping tables for machine learning
https://skrub-data.org/
BSD 3-Clause "New" or "Revised" License
1.22k stars 97 forks source link

Add a threshold to DropColumnIfNull #1147

Open rcap107 opened 3 days ago

rcap107 commented 3 days ago

Problem Description

The current version of DropColumnIfNull automatically drops any columns where all values are null. However, it would be good to have a threshold so that all columns that have a fraction of null values larger than a given threshold are dropped instead.

Feature Description

The DropColumnIfNull object should be modified so that it can take a "threshold" parameter, and then drop columns whose null fraction is above this threshold.

The value of the threshold should be 1.0 by default (i.e., all columns must be null to drop), and there should be a parameter for it in the TableVectorizer that the user can modify to tweak this.

Something to note is that the user may want to have different thresholds for different columns, which would complicate things.

Alternative Solutions

I am not sure whether we should have two separate objects like DropColumnIfNull and DropNullsAboveThreshold (tentative name), or if DropColumnIfNull should stay and get updated with the threshold.

We might want to have two different objects, one that takes a single value for the threshold, and one that allows for more granularity. Though, at that point it might just be better going with a TableVectorizer with specific transformers for each column :thinking:

Additional Context

No response

Vincent-Maladiere commented 2 days ago

Hey @rcap107! We can add this parameter to the current DropNull object and rename it slightly IMO. Then, you could have a single parameter like drop_null_frac in the TableVectorizer, which is set to 1.0 by default (and None would mean no drop mechanism). Ideally, I would rather avoid having 2 separate input parameters in the TableVectorizer for this.

I guess having a single threshold for all columns is fine for now to keep things simple. WDYT?

rcap107 commented 1 day ago

Agreed, the first version should definitely be just a single threshold that's used for all columns, and that can take a single default value.

I still think that having an object that does the filtering with different thresholds for each column would be useful, but I don't see that as being a "default" part of the TV. Something for a different PR!

rcap107 commented 1 day ago

First draft in #1149