Closed gdmcdonald closed 2 years ago
Hi @gdmcdonald,
As I mentioned over in #70, I still think {recipes}
is the way to go. In this case, a very simple recipe that only removes missing data should do the trick. Basically, the example from above is the exact same (except you can remove the na.omit()
from where you define dataset
) right up to where you define formulas
and then I added this:
# Create a recipe from each formula
recipes <-
map(formulas, function(form) {
recipe(form, data = training(trn_tst_split)) %>%
step_naomit(all_predictors())
})
# Create workflow set
cancer_workflows <-
workflow_set(
preproc = recipes,
models = list(rf = rf_spec)
)
During the training loop, each time the recipe is trained it will remove missing data from the analysis()
portion of the CV fold depending on which predictors are in the model formula and the patterns of missing data.
Thank you @mattwarkentin, that's a great direction to go in. Unfortunately still all the models with missing data fail, (with or without skip = TRUE
in step_naomit()
) as the missing rows are still in the cv folds and the testing set which seems to break it. Do I need to somehow map different cv folds to each model as well?
On further searching I think this error is the culprit https://github.com/tidymodels/tune/issues/181 https://github.com/imbs-hl/ranger/issues/94
Would love to know how I can write a wrapper around ranger so that it handles na values in the normal R way to fix the problem, but any tips on working around the issue in the meantime would be appreciated.
If you use step_naomit(all_predictors(), skip = TRUE)
that will remove NA
values for the observations you are using for training. I believe the problem at this point is that you end up with missing data in the observations you want to predict on. I get errors like (notice (predictions)
):
preprocessor 1/1, model 1/24 (predictions): Error: Missing data in columns: Bare.nuclei.
You've probably noticed the info/advice we give on recipe steps that involve removing rows, including step_naomit()
:
This step can entirely remove observations (rows of data), which can have unintended and/or problematic consequences when applying the step to new data later via [bake()]. Consider whether
skip = TRUE
orskip = FALSE
is more appropriate in any given use case. In most instances that affect the rows of the data being predicted, this step probably should not be applied at all; instead, execute operations like this outside and before starting a preprocessing [recipe()].
This behavior is for sure by design and a safety consideration. It is a pretty strong assumption of the workflowsets package that you are evaluating the different model configurations on the same data; I don't think we're going to want to support a general option for tuning where it is easy to end up with different datasets for different model configurations.
That being said, you are the boss of your own model evaluation, of course! If I really wanted to do this myself, I would probably create a function that took the training set and a set of predictors as arguments and returned a set of metrics. I would remove NA
values in this function before I created the resampling folds. I would then loop/purrr/apply through the sets of predictors to evaluate. For now, this is what we recommend you do in this situation.
Let us know if you have further questions! 🙌
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
At the moment I'm specifying around 1k different model formulas on one input dataframe in a workflowset, and I want it to exclude a data row for a model if that row has missing data in one of the columns required for that particular model, so that the model can run, and so that I'm not imputing the missing values. Is there a nice way to do this with a workflow set? Something like an
na.action = na.omit
option?I want to run the code without the
na.omit
on the last line so that rows are only omitted if the missing value would have been used in the model.The
formulas
which I am testing in this example are(but there are thousands of formulas in my real data set)
Which should give a nice comparison of what adding each variable adds to the ROC AUC