Open mllg opened 5 years ago
Could this also apply to more filters than just FilterAUC
?
why dont we throw an error? if NAs are there. also we seem to be missing a generic test. and we need to clearly doc / decide what happens in these cases for all filters
i guess ignoring the NA-based obs in the calculation is "cleanest" and most robust as michel suggested. but we should probably then do this in a global place, unit test this properly and also document it visibly
a3e43f9ebe9f79059bfc8b423f8583a7b3c12a94 replaced Metrics and ignores NAs, but we still need tests and check the behaviour of other filters.
Ignoring Nas is actually wrong after thinking about this more pls don't merge / release this without further discussion
If you have a feature with 98% missing values and for the remainder there is a high or perfect correlation with the target that feature would get a very high score. That's wrong?
Nas should be an error for filters. And users should transparently impute them.
Agreed?
@mllg Looking at ?mlr3measures::auc()
, the NA value is NaN
. Is this something you added in the meantime which fixes the initial issue or does this have the same effect? (i.e. ordering the NaN features last).
@mllg Looking at
?mlr3measures::auc()
, the NA value isNaN
. Is this something you added in the meantime which fixes the initial issue or does this have the same effect? (i.e. ordering the NaN features last).
NaN
is the return value if you cannot calculate the measure (div/0 etc). Having NA
in truth
or response
always results in an error.
For filters, Bernd suggested throwing an error. I assume this is the safest way to deal with this. If we want to allow missing values by just removing them (as FilterVariance
currently does), this should not be the default behavior.
Right now it seems like NAs are removed prior to score calculation
I would add an assertion which checks for NAs in any feature and apply this to every filter with a descriptive error message to use a pipeop to impute these values?
FilterAUC
operates on features with missing values by just ranking the missing values last (default inrank()
). I'm not sure that this is statistically sound.I'd suggest removing them and calculate the AUC on the remaining observations.
@berndbischl @pat-s ?