mlr-org / mlr3filters

Filter-based feature selection for mlr3
https://mlr3filters.mlr-org.com
GNU Lesser General Public License v3.0
20 stars 8 forks source link

FilterAUC and missing values #60

Open mllg opened 5 years ago

mllg commented 5 years ago

FilterAUC operates on features with missing values by just ranking the missing values last (default in rank()). I'm not sure that this is statistically sound.

I'd suggest removing them and calculate the AUC on the remaining observations.

@berndbischl @pat-s ?

pat-s commented 5 years ago

Could this also apply to more filters than just FilterAUC?

berndbischl commented 5 years ago

why dont we throw an error? if NAs are there. also we seem to be missing a generic test. and we need to clearly doc / decide what happens in these cases for all filters

berndbischl commented 5 years ago

i guess ignoring the NA-based obs in the calculation is "cleanest" and most robust as michel suggested. but we should probably then do this in a global place, unit test this properly and also document it visibly

mllg commented 5 years ago

a3e43f9ebe9f79059bfc8b423f8583a7b3c12a94 replaced Metrics and ignores NAs, but we still need tests and check the behaviour of other filters.

berndbischl commented 5 years ago

Ignoring Nas is actually wrong after thinking about this more pls don't merge / release this without further discussion

berndbischl commented 5 years ago

If you have a feature with 98% missing values and for the remainder there is a high or perfect correlation with the target that feature would get a very high score. That's wrong?

Nas should be an error for filters. And users should transparently impute them.

Agreed?

pat-s commented 4 years ago

@mllg Looking at ?mlr3measures::auc(), the NA value is NaN. Is this something you added in the meantime which fixes the initial issue or does this have the same effect? (i.e. ordering the NaN features last).

mllg commented 4 years ago

@mllg Looking at ?mlr3measures::auc(), the NA value is NaN. Is this something you added in the meantime which fixes the initial issue or does this have the same effect? (i.e. ordering the NaN features last).

NaN is the return value if you cannot calculate the measure (div/0 etc). Having NA in truth or response always results in an error.

For filters, Bernd suggested throwing an error. I assume this is the safest way to deal with this. If we want to allow missing values by just removing them (as FilterVariance currently does), this should not be the default behavior.

pat-s commented 4 years ago

Right now it seems like NAs are removed prior to score calculation

https://github.com/mlr-org/mlr3filters/blob/13988d396181c3d75cb344f5ce1f11d11e5a3910/R/FilterAUC.R#L42-L46

I would add an assertion which checks for NAs in any feature and apply this to every filter with a descriptive error message to use a pipeop to impute these values?