FFTrees() fails ungracefully when a factor in the training data has NA

ndphillips / FFTrees

An R package to create and visualise fast-and-frugal decision trees (FFTs)

https://journal.sjdm.org/17/17217/jdm17217.pdf

135 stars 23 forks source link

FFTrees() fails ungracefully when a factor in the training data has NA #160

Open pa-nathaniel opened 1 year ago

pa-nathaniel commented 1 year ago

When I try running FFTrees() on a training dataset with a factor value that contains an NA value, I see an ungraceful error

Reproducible example below:

library(FFTrees) # 1.9.0

data <- data.frame(crit = c(TRUE, TRUE, FALSE, TRUE),
                   sex = c("m", "f", "m", NA))

FFTrees(formula = crit ~ .,
        data = data)

Returns:

Aiming to create a new FFTrees object:
— Setting 'goal = bacc'
— Setting 'goal.chase = bacc'
— Setting 'goal.threshold = bacc'
— Setting 'max.levels = 4'
— Using default 'cost.outcomes' = (0 1 1 0)
— Using default 'cost.cues' = (0 per cue)
Successfully created a new FFTrees object.
Aiming to define FFTs:
Aiming to create FFTs with 'ifan' algorithm (chasing 'bacc'):
Aiming to rank 1 cues (optimizing 'bacc'):
Error: !any(is.na(cue_v)) is not TRUE

`actual`:   FALSE
`expected`: TRUE

Desired behavior: An informative error message (like "Column {X} in the training data has {Y} NA values. NA values are not allowed in training") or no error and allow the trees to be built.

I believe the source of the error is here: https://github.com/ndphillips/FFTrees/blob/5978e81c192c5a89ec37242c4884a00fa78127b1/R/fftrees_threshold_factor_grid.R#L52

I am using FFTrees 1.9.0

Is it on principle that we don't want to allow factors with NA values in training or is this a bug? It's been a while since I've thought about the algorithm, but my recollection in the past was that we can treat NA as its own (perfectly valid) factor value and include in the tree definitions. Does that sound right?

pa-nathaniel commented 1 year ago

If we do want to allow NA factor values in training, one way to allow this might be to explicitly convert all NA values in training factor columns to the string 'missing' ('NA' is also an option but I fear people will get the string 'NA' confused with NA)

pa-nathaniel commented 1 year ago

Out of curiosity, I tried running the same arguments through rpart::rpart and it runs, but I also see a note about deleting an observation due to missingness.

rpart::rpart(formula = crit ~ .,
         data = data)

# n=3 (1 observation deleted due to missingness)

#node), split, n, deviance, yval
#      * denotes terminal node

# 1) root 3 0.6666667 0.6666667 *

Some interesting discussions in https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf as well

hneth commented 1 year ago

Thanks for raising this topic again! I've been planning to address the problem of missing inputs for a while, but always been distracted by more immediate issues.

Based on the ways in which FFTrees treats factor levels, replacing NA values by some other string should work just fine for factor values (and I'll check this out asap). However, I'd like to find a solution that works for numeric predictors as well.

When using an existing FFT for making decisions/predictions, a missing value for the current cue or node could simply mean "go to/look for the next cue or node". But I yet have to ponder on their implications for constructing trees (or determining cue thresholds) and for missing final cues (which either could require some default decision or motivate the creation of a 3rd "do not know" category).

PS: Dealing with instances of not knowing is becoming a prominent issue in AI and machine learning research.
Some pointers include:

Kompa, B., Snoek, J. & Beam, A.L. (2021). Second opinion needed: communicating uncertainty in medical machine learning. npj Digit. Med. 4, 4. https://doi.org/10.1038/s41746-020-00367-3
Hendrickx et al. (XXXX): Machine Learning with a Reject Option: A survey https://doi.org/10.48550/arXiv.2107.11277
Thulasidasan et al. (XXXX): Knows When it Doesn’t Know: Deep Abstaining Classifiers https://openreview.net/forum?id=rJxF73R9tX

pa-nathaniel commented 1 year ago

Thanks Hans! Yes would be great to unblock the issue for NA factor values as this will be a super common occurrence in real world data. Treating NA values as its own category, both for training and predicting in the final trees, feels right to me.

For numeric data, my gut says that during training, just ignore NA values when determining thresholds and directions. Then during prediction, if a record encounters a numeric node but has an NA value, just don't allow it to exit and force it to move down the tree.

Now if it's the final node in the tree, then maybe decide for whichever direction is most common in the data or something like that.

I'm happy to put in a PR with some ideas this weekend!

hneth commented 1 year ago

I fully agree — and trusting your gut feelings is usually a good idea. So yes, PR whatever you have — I'd be happy to help when I can.

hneth commented 1 year ago

Thanks, Nathaniel, for your PRs on this issue! As I was chasing another bug simultaneously (which led to perfect FFTs not being found, since grow_tree == FALSE for cases of perfect performance was later over-written by stopping.rule evaluation), I only saw them today and merged all changes.

Although removing NA cases from numeric predictors (and corresponding criterion values) seems to work, I'm still getting lots of warnings (from fftrees_threshold_numeric_grid() for unequal vector lengths).

I have yet to fully understand the consequences of ignoring numeric NA values in this way. Won't the current solution create new problems when the corresponding cue is the final node of an FFT?

Also, it would be neat to also find a solution for missing criterion values. Intuitively, this would require a 3rd exit, wouldn't it? (We could simply filter/delete the corresponding rows of data, of course, but that would throw away valuable information and seem like cheating.)

PS: I've added some infrastructure for selectively enabling/disabling NA handling (for predictors vs. criterion) and user feedback on missing data, but this isn't very refined yet.

hneth commented 1 year ago

Including NA values in categorical (character, factor, or logical) predictors seems to work pretty well now, so the primary issue here could be considered to have been solved.

However, I'm still trying to understand the consequences of enabling NA values in numeric predictors.

Here's a reprex (run with FFTrees v1.9.0.9014) that raises some questions:

# Data: 
data_train <- data.frame(crt = c(FALSE, FALSE, TRUE, TRUE),
                         p_1 = c(NA, 2, 3, 4)
)

data_test <- data.frame(crt = c(FALSE, FALSE, TRUE, TRUE),
                        p_1 = c(1, 2, 3, NA)
)

# Create FFTs:
x <- FFTrees(crt ~ .,
             data = data_train,
             data.test = data_test, 
             do.comp = FALSE)

# Results:
summary(x)
plot(x, what = "cues")

plot(x, data = "train")
x$trees$decisions$train

plot(x, data = "test")
x$trees$decisions$test

On the positive side, it's nice that FFTrees now runs despite the NA case in a numeric predictor. And while the results do not seem unreasonable, I have yet to understand some details:

Questions

Why is the chosen criterion threshold value $> 3$, rather than $> 2$ (leading to the 3rd case being erroneously classified as FALSE in training)?
Why are the NA cases of data_train and data_test classified as FALSE?
Which feedback should the user obtain on what was done/decided when a predictor value was NA?

hneth commented 1 year ago

A status update (on the discussion above and the reprex results):

As it turns out, allowing for NA values in numeric predictors is more complicated than we thought.

Adding some diagnostic and user feedback options now shows that the previous removal of non-finite cases from classtable() did not affect the initial cue evaluation at all, but instead removed the corresponding cases from FFT construction. (The warnings were caused by cue evaluation collapsing two vectors of unequal length to compute the frequency counts of a 2x2 classification table. As NA values were dropped only from 1 vector, the correspondence relation between both vectors shifted, potentially leading to grave errors.)

Hence, to handle NA values consistently, we need to explicitly deal with them whenever 2x2 tables are being formed:

in cue evaluation
in FFT construction
in applying FFTs to (train vs. test) data (to evaluate their performance)

The current version (v 1.9.0.9015) includes some code and options for 1. and 2. (and limits exclusions to NA cases, rather than any non-finite values). Essentially, NA cases in numeric predictors are now ignored in (or dropped from) both cue evaluation and tree construction. This seems to work so far (and the reprex above now yields a perfect tree with threshold value $> 2$ in training, but only 75% accuracy in testing, due to misclassifying the missing test case as FALSE).

This still leaves the issue of 3. to be addressed. It's here that we need to specify what to do (what to decide/predict) when encountering an NA in applying an FFT.

ndphillips commented 6 months ago

@hneth has this issue been addressed with https://github.com/ndphillips/FFTrees/pull/178? If so, can we close this?

ndphillips commented 5 months ago

@hneth pinging you again, see above