tidyverse / modelr

Helper functions for modelling
https://modelr.tidyverse.org
GNU General Public License v3.0
401 stars 65 forks source link

Adding residuals for nonstandard models to tibble of resamples #32

Closed tchakravarty closed 6 years ago

tchakravarty commented 8 years ago

I could not get add_residuals to work on a random forest model when working with resampled data:

library(tidyverse)
library(modelr)
library(randomForest)
library(broom)

df_foo = data_frame(
  x1 = rnorm(1000),
  x2 = rnorm(1000),
  y = rnorm(1000)
)

fit_rf = function(data) {
  randomForest(
    y ~ ., 
    data = data
    )
}

# add_residuals
df_foo = df_foo %>% 
  crossv_kfold(k = 10) %>% 
  mutate(
    `RF (Model)` = map(
      .x = train, 
      .f = fit_rf
    ),
    `RF (Residuals)` = map2(
      .x = as.data.frame(train), 
      .y = `RF (Model)`,
      add_residuals
    )
  ) 

I also tried to broom::augment the residuals in, but that didn't work either.

I then tried to then write the logic myself:

df_foo = df_foo %>% 
  crossv_kfold(k = 10) %>% 
  mutate(
    `RF (Model)` = map(
      .x = train, 
      .f = fit_rf
    ),
    `RF (Predictions)` = map2(
      .x = as.data.frame(train), 
      .y = `RF (Model)`,
      predict
    ),
    `RF (Residuals)` = map(
      .x = as.data.frame(train), 
      .f = "y"
    ) - `RF (Predictions)`
  ) 

but this terminates with the error: Error:.x(30) and.y(10) are different lengths. Not sure if I am doing something wrong but the end goal is to be able to add a list-column of vectors of residuals.

PS. Can the as.data.frame calls be made redundant when mapping non-modelr functions to resamples, perhaps by adding a data_frame class to them?


Edit:

So I tried the advice here to get the call to augment to work, but that fails:

df_foo = df_foo %>% 
  crossv_kfold(k = 10) %>% 
  mutate(
    `RF (Model)` = map(
      .x = train, 
      .f = fit_rf
    ),
    `RF (Residuals)` = map2(
      .x = train, 
      .y = `RF (Model)`,
      ~ augment(x = .y, newdata = .x)
    )
  ) 

but this failed with an error: Error: augment doesn't know how to deal with data of class randomForest.formularandomForest.

I tried the same approach using add_residuals:

df_foo = df_foo %>% 
  crossv_kfold(k = 10) %>% 
  mutate(
    `RF (Model)` = map(
      .x = train, 
      .f = fit_rf
    ),
    `RF (Residuals)` = map2(
      .x = train, 
      .y = `RF (Model)`,
      ~ add_residuals(data = .x, model = .y)
    )
  ) 

but that appears to have "failed gracefully" -- giving the resamples back.

hadley commented 6 years ago

You were just putting as.data.frame() in the wrong place

library(tidyverse)
library(modelr)
library(randomForest, warn.conflicts = FALSE)
#> Warning: package 'randomForest' was built under R version 3.4.4
#> randomForest 4.6-14
#> Type rfNews() to see new features/changes/bug fixes.

df <- data_frame(
  x1 = rnorm(500),
  x2 = rnorm(500),
  y = rnorm(500)
)
mods <- df %>% 
  crossv_kfold(k = 4) %>% 
  mutate(model = map(train, ~ randomForest(y ~ ., data = .x)))

mods %>% mutate(resid = map2(train, model, ~ add_residuals(as.data.frame(.x), .y)))
#> # A tibble: 4 x 5
#>   train          test           .id   model                      resid    
#>   <list>         <list>         <chr> <list>                     <list>   
#> 1 <S3: resample> <S3: resample> 1     <S3: randomForest.formula> <tibble …
#> 2 <S3: resample> <S3: resample> 2     <S3: randomForest.formula> <tibble …
#> 3 <S3: resample> <S3: resample> 3     <S3: randomForest.formula> <tibble …
#> 4 <S3: resample> <S3: resample> 4     <S3: randomForest.formula> <tibble …

Created on 2018-05-10 by the reprex package (v0.2.0).