tidymodels / parsnip

A tidy unified interface to models
https://parsnip.tidymodels.org
Other
593 stars 88 forks source link

Issue with `fit_xy()` in presence of missing values #548

Open p-lemercier opened 3 years ago

p-lemercier commented 3 years ago

I get an error message when trying to fit a model using fit_xy() in presence of missing values but not with fit().

Here is just one example with rand_forest() %>% set_engine("ranger"). The same problem occurs with rand_forest() %>% set_engine("randomForest") or linear_reg(penalty = 1, mixture = 1) %>% set_engine("glmnet").

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip

data(airquality)
summary(airquality)
#>      Ozone           Solar.R           Wind             Temp      
#>  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
#>  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
#>  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
#>  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
#>  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
#>  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
#>  NA's   :37       NA's   :7                                       
#>      Month            Day      
#>  Min.   :5.000   Min.   : 1.0  
#>  1st Qu.:6.000   1st Qu.: 8.0  
#>  Median :7.000   Median :16.0  
#>  Mean   :6.993   Mean   :15.8  
#>  3rd Qu.:8.000   3rd Qu.:23.0  
#>  Max.   :9.000   Max.   :31.0  
#> 

mod1 <- rand_forest() %>% 
  set_engine("ranger") %>% 
  set_mode("regression") %>% 
  fit(Ozone ~ Solar.R + Temp + Wind, data = airquality)

mod2 <- rand_forest() %>% 
  set_engine("ranger") %>% 
  set_mode("regression") %>% 
  fit_xy(x = airquality[,c("Solar.R", "Temp", "Wind")], y = airquality$Ozone)
#> Error: Missing data in columns: Solar.R.
#> Chronométrage arrêté à : 0 0 0.001

Created on 2021-08-29 by the reprex package (v2.0.1)

EmilHvitfeldt commented 3 years ago

As you have shown, Solar.R does contain 7 missing values, and it is this that is making fit_xy() complain. The error you are seeing is coming from the engine (ranger::ranger) which requires that there is no missing data in the data being passed to it. Let me go over why fit() and fit_xy() don't behave the same in this situation.

When you use any parsnip fit function, the data will be passed around a little bit by some helper functions to turn the data into the format that the engine function needs. In this ranger example, we are using x and y to pass the data.

library(parsnip)
rand_forest() %>%  
  set_engine("ranger") %>% 
  set_mode("regression") %>% 
  translate()
#> Random Forest Model Specification (regression)
#> 
#> Computational engine: ranger 
#> 
#> Model fit template:
#> ranger::ranger(x = missing_arg(), y = missing_arg(), case.weights = missing_arg(), 
#>     num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 
#>         1))

This means that when we use fit_xy() we need to do minimal transformation because the input is in the right format. It will be passed to xy_xy() and the model will be fit using the data you passed in. Then ranger::ranger() throws an error because there is missing data.

When you use fit() instead you are passing the data as a formula, so parsnip will need to turn it from a formula to x and y. Along the way parsnip will use .convert_form_to_xy_fit() which defaults to na.action = na.omit and it drops any rows with missing data in it, thus stopping you from getting an error.

zenggyu commented 3 years ago

Does it mean that we should always impute missing values before using fit() so that rows with missing values aren't dropped by default?

juliasilge commented 3 years ago

I actually don't think that explanation is quite it; I think we have something more subtle/wrong going on.

If I step through this with the debugger, this line is doing nothing: https://github.com/tidymodels/parsnip/blob/af3e0aefd6151c30f22f71e848aa34a171ecf7c2/R/convert_data.R#L53 After that line, there is no na.action on the model.frame() call. I tried changing, for example, from na.omit to na.pass and nothing changed about the behavior. I'm going to call this a bug with the converting functions and see what we can do about this.

simonpcouch commented 2 years ago

Just chiming in that Julia's explanation here seems spot on + equally concerning to me. From my understanding, parsnip silently drops incomplete cases before handing off data to modeling functions in many cases, and this behavior can't be adjusted by the user as .convert_form_to_xy_* are called internally.

library(parsnip)

data(penguins, package = "modeldata")

nrow(penguins)
#> [1] 344
nrow(penguins[complete.cases(penguins),])
#> [1] 333

converted <- .convert_form_to_xy_fit(bill_length_mm ~ ., data = penguins)
nrow(converted$x)
#> [1] 333

Created on 2022-06-07 by the reprex package (v2.0.1)

This seems to be somewhat well-tested, so I assume this is intentional. Will bring this up in team meeting to talk whether this deserves a fix / improvement / docs / etc. :)

CarlViggo commented 9 months ago

Hi, as Zenggyu previously asked, is there anyone who knows whether missing values always should be removed before using fit() so that rows with missing values aren't dropped by default? If you'd like to keep the missing values, is there any strategy that one could have for it not to throw an error?