Open p-lemercier opened 3 years ago
As you have shown, Solar.R
does contain 7 missing values, and it is this that is making fit_xy()
complain. The error you are seeing is coming from the engine (ranger::ranger
) which requires that there is no missing data in the data being passed to it. Let me go over why fit()
and fit_xy()
don't behave the same in this situation.
When you use any parsnip fit function, the data will be passed around a little bit by some helper functions to turn the data into the format that the engine function needs. In this ranger example, we are using x
and y
to pass the data.
library(parsnip)
rand_forest() %>%
set_engine("ranger") %>%
set_mode("regression") %>%
translate()
#> Random Forest Model Specification (regression)
#>
#> Computational engine: ranger
#>
#> Model fit template:
#> ranger::ranger(x = missing_arg(), y = missing_arg(), case.weights = missing_arg(),
#> num.threads = 1, verbose = FALSE, seed = sample.int(10^5,
#> 1))
This means that when we use fit_xy()
we need to do minimal transformation because the input is in the right format. It will be passed to xy_xy()
and the model will be fit using the data you passed in. Then ranger::ranger()
throws an error because there is missing data.
When you use fit()
instead you are passing the data as a formula, so parsnip will need to turn it from a formula to x
and y
. Along the way parsnip will use .convert_form_to_xy_fit()
which defaults to na.action = na.omit
and it drops any rows with missing data in it, thus stopping you from getting an error.
Does it mean that we should always impute missing values before using fit()
so that rows with missing values aren't dropped by default?
I actually don't think that explanation is quite it; I think we have something more subtle/wrong going on.
If I step through this with the debugger, this line is doing nothing:
https://github.com/tidymodels/parsnip/blob/af3e0aefd6151c30f22f71e848aa34a171ecf7c2/R/convert_data.R#L53
After that line, there is no na.action
on the model.frame()
call. I tried changing, for example, from na.omit
to na.pass
and nothing changed about the behavior. I'm going to call this a bug with the converting functions and see what we can do about this.
Just chiming in that Julia's explanation here seems spot on + equally concerning to me. From my understanding, parsnip silently drops incomplete cases before handing off data to modeling functions in many cases, and this behavior can't be adjusted by the user as .convert_form_to_xy_*
are called internally.
library(parsnip)
data(penguins, package = "modeldata")
nrow(penguins)
#> [1] 344
nrow(penguins[complete.cases(penguins),])
#> [1] 333
converted <- .convert_form_to_xy_fit(bill_length_mm ~ ., data = penguins)
nrow(converted$x)
#> [1] 333
Created on 2022-06-07 by the reprex package (v2.0.1)
This seems to be somewhat well-tested, so I assume this is intentional. Will bring this up in team meeting to talk whether this deserves a fix / improvement / docs / etc. :)
Hi, as Zenggyu previously asked, is there anyone who knows whether missing values always should be removed before using fit() so that rows with missing values aren't dropped by default? If you'd like to keep the missing values, is there any strategy that one could have for it not to throw an error?
I get an error message when trying to fit a model using
fit_xy()
in presence of missing values but not withfit()
.Here is just one example with
rand_forest() %>% set_engine("ranger")
. The same problem occurs withrand_forest() %>% set_engine("randomForest")
orlinear_reg(penalty = 1, mixture = 1) %>% set_engine("glmnet")
.Created on 2021-08-29 by the reprex package (v2.0.1)