Open pa-nathaniel opened 1 year ago
If we do want to allow NA factor values in training, one way to allow this might be to explicitly convert all NA values in training factor columns to the string 'missing' ('NA' is also an option but I fear people will get the string 'NA'
confused with NA
)
Out of curiosity, I tried running the same arguments through rpart::rpart
and it runs, but I also see a note about deleting an observation due to missingness.
rpart::rpart(formula = crit ~ .,
data = data)
# n=3 (1 observation deleted due to missingness)
#node), split, n, deviance, yval
# * denotes terminal node
# 1) root 3 0.6666667 0.6666667 *
Some interesting discussions in https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf as well
Thanks for raising this topic again! I've been planning to address the problem of missing inputs for a while, but always been distracted by more immediate issues.
Based on the ways in which FFTrees treats factor levels, replacing NA
values by some other string should work just fine for factor values (and I'll check this out asap). However, I'd like to find a solution that works for numeric predictors as well.
When using an existing FFT for making decisions/predictions, a missing value for the current cue or node could simply mean "go to/look for the next cue or node". But I yet have to ponder on their implications for constructing trees (or determining cue thresholds) and for missing final cues (which either could require some default decision or motivate the creation of a 3rd "do not know" category).
PS: Dealing with instances of not knowing is becoming a prominent issue in AI and machine learning research.
Some pointers include:
Kompa, B., Snoek, J. & Beam, A.L. (2021). Second opinion needed: communicating uncertainty in medical machine learning. npj Digit. Med. 4, 4. https://doi.org/10.1038/s41746-020-00367-3
Hendrickx et al. (XXXX): Machine Learning with a Reject Option: A survey https://doi.org/10.48550/arXiv.2107.11277
Thulasidasan et al. (XXXX): Knows When it Doesn’t Know: Deep Abstaining Classifiers https://openreview.net/forum?id=rJxF73R9tX
Thanks Hans! Yes would be great to unblock the issue for NA factor values as this will be a super common occurrence in real world data. Treating NA values as its own category, both for training and predicting in the final trees, feels right to me.
For numeric data, my gut says that during training, just ignore NA values when determining thresholds and directions. Then during prediction, if a record encounters a numeric node but has an NA value, just don't allow it to exit and force it to move down the tree.
I'm happy to put in a PR with some ideas this weekend!
I fully agree — and trusting your gut feelings is usually a good idea. So yes, PR whatever you have — I'd be happy to help when I can.
Thanks, Nathaniel, for your PRs on this issue! As I was chasing another bug simultaneously (which led to perfect FFTs not being found, since grow_tree == FALSE
for cases of perfect performance was later over-written by stopping.rule
evaluation), I only saw them today and merged all changes.
Although removing NA
cases from numeric predictors (and corresponding criterion values) seems to work, I'm still getting lots of warnings (from fftrees_threshold_numeric_grid()
for unequal vector lengths).
I have yet to fully understand the consequences of ignoring numeric NA
values in this way. Won't the current solution create new problems when the corresponding cue is the final node of an FFT?
Also, it would be neat to also find a solution for missing criterion values. Intuitively, this would require a 3rd exit, wouldn't it? (We could simply filter/delete the corresponding rows of data, of course, but that would throw away valuable information and seem like cheating.)
PS: I've added some infrastructure for selectively enabling/disabling NA
handling (for predictors vs. criterion) and user feedback on missing data, but this isn't very refined yet.
Including NA
values in categorical (character, factor, or logical) predictors seems to work pretty well now, so the primary issue here could be considered to have been solved.
However, I'm still trying to understand the consequences of enabling NA
values in numeric predictors.
Here's a reprex (run with FFTrees v1.9.0.9014) that raises some questions:
# Data:
data_train <- data.frame(crt = c(FALSE, FALSE, TRUE, TRUE),
p_1 = c(NA, 2, 3, 4)
)
data_test <- data.frame(crt = c(FALSE, FALSE, TRUE, TRUE),
p_1 = c(1, 2, 3, NA)
)
# Create FFTs:
x <- FFTrees(crt ~ .,
data = data_train,
data.test = data_test,
do.comp = FALSE)
# Results:
summary(x)
plot(x, what = "cues")
plot(x, data = "train")
x$trees$decisions$train
plot(x, data = "test")
x$trees$decisions$test
On the positive side, it's nice that FFTrees now runs despite the NA
case in a numeric predictor.
And while the results do not seem unreasonable, I have yet to understand some details:
Questions
Why is the chosen criterion threshold value $> 3$, rather than $> 2$
(leading to the 3rd case being erroneously classified as FALSE
in training)?
Why are the NA
cases of data_train
and data_test
classified as FALSE
?
Which feedback should the user obtain on what was done/decided when a predictor value was NA
?
A status update (on the discussion above and the reprex results):
As it turns out, allowing for NA
values in numeric predictors is more complicated than we thought.
Adding some diagnostic and user feedback options now shows that the previous removal of non-finite cases from classtable()
did not affect the initial cue evaluation at all, but instead removed the corresponding cases from FFT construction. (The warnings were caused by cue evaluation collapsing two vectors of unequal length to compute the frequency counts of a 2x2 classification table. As NA
values were dropped only from 1 vector, the correspondence relation between both vectors shifted, potentially leading to grave errors.)
Hence, to handle NA
values consistently, we need to explicitly deal with them whenever 2x2 tables are being formed:
The current version (v 1.9.0.9015) includes some code and options for 1. and 2. (and limits exclusions to NA
cases, rather than any non-finite values). Essentially, NA
cases in numeric predictors are now ignored in (or dropped from) both cue evaluation and tree construction. This seems to work so far (and the reprex above now yields a perfect tree with threshold value $> 2$ in training, but only 75% accuracy in testing, due to misclassifying the missing test case as FALSE
).
This still leaves the issue of 3. to be addressed. It's here that we need to specify what to do (what to decide/predict) when encountering an NA
in applying an FFT.
@hneth has this issue been addressed with https://github.com/ndphillips/FFTrees/pull/178? If so, can we close this?
@hneth pinging you again, see above
When I try running
FFTrees()
on a training dataset with a factor value that contains an NA value, I see an ungraceful errorReproducible example below:
Returns:
Desired behavior: An informative error message (like "Column {X} in the training data has {Y} NA values. NA values are not allowed in training") or no error and allow the trees to be built.
I believe the source of the error is here: https://github.com/ndphillips/FFTrees/blob/5978e81c192c5a89ec37242c4884a00fa78127b1/R/fftrees_threshold_factor_grid.R#L52
I am using FFTrees 1.9.0
Is it on principle that we don't want to allow factors with NA values in training or is this a bug? It's been a while since I've thought about the algorithm, but my recollection in the past was that we can treat NA as its own (perfectly valid) factor value and include in the tree definitions. Does that sound right?