tidyverse / modelr

Helper functions for modelling
https://modelr.tidyverse.org
GNU General Public License v3.0
401 stars 66 forks source link

add_predictions should default to type = "raw" #112

Closed kbodwin closed 10 months ago

kbodwin commented 3 years ago

Hi,

I wanted to open this issue before PR'ing, because maybe there's a good reason for the current setup that I'm not seeing.

Short version: I think add_predictions() should have default argument type = "raw", instead of the current default type = NULL.

Here's why:

parsnip introduces the method for predict.model_fit. If type = NULL is passed to predict.model_fit, it infers the type from the model object. This means that the output of predict() is different depending on if you made your model with tidymodels or not:

library(palmerpenguins)
library(modelr)
library(tidymodels)

my_glm <- stats::glm(formula = sex ~ bill_length_mm, 
                     family = stats::binomial, 
                     data = penguins)

my_tm_glm <- 
  logistic_reg(mode = "classification") %>% 
  set_engine(engine = "glm") %>%
  fit(formula = sex ~ bill_length_mm, 
      data = penguins)

stats::predict(my_tm_glm, penguins) %>% head()
#> # A tibble: 6 x 1
#>   .pred_class
#>   <fct>      
#> 1 female     
#> 2 female     
#> 3 female     
#> 4 <NA>       
#> 5 female     
#> 6 female

stats::predict(my_glm, penguins) %>% head()
#>          1          2          3          4          5          6 
#> -0.6525769 -0.5974964 -0.4873355         NA -0.9830596 -0.6250367

Created on 2020-12-18 by the reprex package (v0.3.0)

I'm not a huge fan of predict() giving two different object types depending on input class, but I can accept there might be reasons for that.

Where it breaks is when predict() is being called under the hood for add_predictions():

penguins %>%
  select(sex, bill_length_mm) %>%
  modelr::add_predictions(my_glm) %>%
  head()
#> # A tibble: 6 x 3
#>   sex    bill_length_mm   pred
#>   <fct>           <dbl>  <dbl>
#> 1 male             39.1 -0.653
#> 2 female           39.5 -0.597
#> 3 female           40.3 -0.487
#> 4 <NA>             NA   NA    
#> 5 female           36.7 -0.983
#> 6 male             39.3 -0.625

penguins %>%
  select(sex, bill_length_mm) %>%
  modelr::add_predictions(my_tm_glm) %>%
  head()
#> # A tibble: 6 x 3
#>   sex    bill_length_mm pred$.pred_class
#>   <fct>           <dbl> <fct>           
#> 1 male             39.1 female          
#> 2 female           39.5 female          
#> 3 female           40.3 female          
#> 4 <NA>             NA   <NA>            
#> 5 female           36.7 female          
#> 6 male             39.3 female

At best, the predictions column is weirdly named, since it's trying to coerce a tibble into a column. I've also run into applications where you get nested tibbles.

Of course, this can be circumvented with type = "raw":

penguins %>%
  select(sex, bill_length_mm) %>%
  modelr::add_predictions(my_tm_glm, type = "raw") %>%
  head()
#> # A tibble: 6 x 3
#>   sex    bill_length_mm   pred
#>   <fct>           <dbl>  <dbl>
#> 1 male             39.1 -0.653
#> 2 female           39.5 -0.597
#> 3 female           40.3 -0.487
#> 4 <NA>             NA   NA    
#> 5 female           36.7 -0.983
#> 6 male             39.3 -0.625

... but that just seems highly counterintuitive, since the whole point of add_predictions() is a shortcut to plop a predictions column in there. It took me a long time to track down the type = "raw" fix. I have to imagine that "raw" represents nearly every use case, and anyone with a corner case need would know to fiddle with type.

Anyways sorry this is so long, just wanted to be clear!

kbodwin commented 3 years ago

I should add that changing the var argument of add_predictions() doesn't solve the problem since it renames the tibble rather than the column; e.g. var = "newname" would produce the column newname$.pred_class.

andrewbaxter439 commented 3 years ago

The type = "raw" fix here is very helpful - thanks @kbodwin!. Would second the suggestion of making it a default in order to ensure that it's predictable what is being returned. In particular this seems a problem with tidymodels as the same predict method being called on an lm object or a model_fit object returns a vector in one case and a tibble in the other. Changing the default or adding in a test for object type maybe could be useful for smoothing the interaction of these two packages?

CarstenLange commented 2 years ago

@kbodwin Thanks for the post! I agree something is wrong with add_predict when working with tidymodels. I used unnest() to fix the bug. See below: penguins %>% select(sex, bill_length_mm) %>% modelr::add_predictions(my_tm_glm) %>% unnest(cols = c(pred)) %>% head()

hadley commented 10 months ago

modelr is now superseded, which means that we'll only perform critical bug fixes needed to keep it on CRAN. Thanks for contributing this idea and my apologies that it took so long to inform you that this package is no longer under development.