markvanderloo / simputation

Making imputation easy
GNU General Public License v3.0
91 stars 11 forks source link

Using impute_mf with a tibble #24

Closed bcjaeger closed 5 years ago

bcjaeger commented 5 years ago

It looks like the missForest package isn't tibble friendly. If you want to pass a tibble into the simputation function impute_mf, the following error pops up.


library(tidyverse)
library(simputation)
library(missForest)
#> Loading required package: randomForest
#> randomForest 4.6-14
#> Type rfNews() to see new features/changes/bug fixes.
#> 
#> Attaching package: 'randomForest'
#> The following object is masked from 'package:simputation':
#> 
#>     na.roughfix
#> The following object is masked from 'package:dplyr':
#> 
#>     combine
#> The following object is masked from 'package:ggplot2':
#> 
#>     margin
#> Loading required package: foreach
#> 
#> Attaching package: 'foreach'
#> The following objects are masked from 'package:purrr':
#> 
#>     accumulate, when
#> Loading required package: itertools
#> Loading required package: iterators

df=tibble::tibble(y=rnorm(100),x1=rnorm(100),x2=rnorm(100))
for(i in seq_along(names(df))) df[sample(1:nrow(df),10),i]=NA
df
#> # A tibble: 100 x 3
#>           y      x1      x2
#>       <dbl>   <dbl>   <dbl>
#>  1   0.252    0.418  1.08  
#>  2   0.665    0.565 -1.73  
#>  3   0.592   -0.264 -0.356 
#>  4  NA       -1.36   0.884 
#>  5   0.0819  -1.42   0.481 
#>  6   0.407   NA      0.732 
#>  7  -0.750    0.862  0.409 
#>  8   1.21    -1.33  -0.0501
#>  9  NA        0.644  0.206 
#> 10   1.38    -2.77  -0.519 
#> # ... with 90 more rows

# This will throw an error because df is a tibble and missForest
# expects a data frame as an input for its xmis argument

df %>% impute_mf(formula=y~.)  
#> Warning in mean.default(xmis[, t.co], na.rm = TRUE): argument is not
#> numeric or logical: returning NA

#> Warning in mean.default(xmis[, t.co], na.rm = TRUE): argument is not
#> numeric or logical: returning NA

#> Warning in mean.default(xmis[, t.co], na.rm = TRUE): argument is not
#> numeric or logical: returning NA
#>   missForest iteration 1 in progress...
#> Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
#> mtry, : The response has five or fewer unique values. Are you sure you want
#> to do regression?
#> Warning: Could not execute missForest::missForest: length of response must be the same as predictors
#>  Returning original data
#> # A tibble: 100 x 3
#>           y      x1      x2
#>       <dbl>   <dbl>   <dbl>
#>  1   0.252    0.418  1.08  
#>  2   0.665    0.565 -1.73  
#>  3   0.592   -0.264 -0.356 
#>  4  NA       -1.36   0.884 
#>  5   0.0819  -1.42   0.481 
#>  6   0.407   NA      0.732 
#>  7  -0.750    0.862  0.409 
#>  8   1.21    -1.33  -0.0501
#>  9  NA        0.644  0.206 
#> 10   1.38    -2.77  -0.519 
#> # ... with 90 more rows

# This runs okay because the tibble is converted to a data.frame
# before it is passed to the missForest function.

df_okay %>% as.data.frame() %>% impute_mf(y~.)
#> Error in eval(lhs, parent, parent): object 'df_okay' not found

head(df_okay)
#> Error in head(df_okay): object 'df_okay' not found

# This adjustment to the impute_mf function makes the initial code
# run without throwing an error. 

impute_mf_bcj<-function (dat, formula, ...) 
{
  stopifnot(inherits(formula, "formula"))
  if (simputation:::not_installed("missForest")) 
    return(dat)
  imputed <- simputation:::get_imputed(formula, dat)
  predictors <- simputation:::get_predictors(formula, dat, ...)
  vars <- unique(c(imputed, predictors))
  imp <- tryCatch(missForest::missForest(as.data.frame(dat[vars]), ...)[[1]], 
    error = function(e) {
      warnf("Could not execute missForest::missForest: %s\n Returning original data", 
        e$message)
      dat
    })
  if (length(imputed) == 0) {
    dat[vars] <- imp[vars]
  }
  else {
    dat[imputed] <- imp[imputed]
  }
  dat
}

df %>% impute_mf_bcj(formula=y~.)
#>   missForest iteration 1 in progress...done!
#>   missForest iteration 2 in progress...done!
#>   missForest iteration 3 in progress...done!
#>   missForest iteration 4 in progress...done!
#> # A tibble: 100 x 3
#>          y      x1      x2
#>      <dbl>   <dbl>   <dbl>
#>  1  0.252    0.418  1.08  
#>  2  0.665    0.565 -1.73  
#>  3  0.592   -0.264 -0.356 
#>  4  0.775   -1.36   0.884 
#>  5  0.0819  -1.42   0.481 
#>  6  0.407   NA      0.732 
#>  7 -0.750    0.862  0.409 
#>  8  1.21    -1.33  -0.0501
#>  9  1.07     0.644  0.206 
#> 10  1.38    -2.77  -0.519 
#> # ... with 90 more rows

Created on 2019-01-20 by the reprex package (v0.2.1)

bcjaeger commented 5 years ago

Sorry, I had a bug in my first post.


library(tidyverse)
library(simputation)
library(missForest)
#> Loading required package: randomForest
#> randomForest 4.6-14
#> Type rfNews() to see new features/changes/bug fixes.
#> 
#> Attaching package: 'randomForest'
#> The following object is masked from 'package:simputation':
#> 
#>     na.roughfix
#> The following object is masked from 'package:dplyr':
#> 
#>     combine
#> The following object is masked from 'package:ggplot2':
#> 
#>     margin
#> Loading required package: foreach
#> 
#> Attaching package: 'foreach'
#> The following objects are masked from 'package:purrr':
#> 
#>     accumulate, when
#> Loading required package: itertools
#> Loading required package: iterators

df=tibble::tibble(y=rnorm(100),x1=rnorm(100),x2=rnorm(100))
for(i in seq_along(names(df))) df[sample(1:nrow(df),10),i]=NA
df
#> # A tibble: 100 x 3
#>           y       x1     x2
#>       <dbl>    <dbl>  <dbl>
#>  1   0.636    1.20    0.439
#>  2  NA       -0.0216  0.475
#>  3   0.0709   1.58    1.13 
#>  4  -0.425   -0.272  -0.856
#>  5  -0.265    0.212  -0.475
#>  6  -1.16    -0.117  -0.911
#>  7   1.23     0.0315  0.824
#>  8   0.783   NA       1.01 
#>  9  -0.0249   1.02   -0.490
#> 10   0.481    0.888   2.35 
#> # ... with 90 more rows

# This will throw an error because df is a tibble and missForest
# expects a data frame as an input for its xmis argument

df %>% impute_mf(formula=y~.)  
#> Warning in mean.default(xmis[, t.co], na.rm = TRUE): argument is not
#> numeric or logical: returning NA

#> Warning in mean.default(xmis[, t.co], na.rm = TRUE): argument is not
#> numeric or logical: returning NA

#> Warning in mean.default(xmis[, t.co], na.rm = TRUE): argument is not
#> numeric or logical: returning NA
#>   missForest iteration 1 in progress...
#> Warning in randomForest.default(x = obsX, y = obsY, ntree = ntree, mtry =
#> mtry, : The response has five or fewer unique values. Are you sure you want
#> to do regression?
#> Warning: Could not execute missForest::missForest: length of response must be the same as predictors
#>  Returning original data
#> # A tibble: 100 x 3
#>           y       x1     x2
#>       <dbl>    <dbl>  <dbl>
#>  1   0.636    1.20    0.439
#>  2  NA       -0.0216  0.475
#>  3   0.0709   1.58    1.13 
#>  4  -0.425   -0.272  -0.856
#>  5  -0.265    0.212  -0.475
#>  6  -1.16    -0.117  -0.911
#>  7   1.23     0.0315  0.824
#>  8   0.783   NA       1.01 
#>  9  -0.0249   1.02   -0.490
#> 10   0.481    0.888   2.35 
#> # ... with 90 more rows

# This runs okay because the tibble is converted to a data.frame
# before it is passed to the missForest function.

df_okay = df %>% as.data.frame() %>% impute_mf(y~.)
#>   missForest iteration 1 in progress...done!
#>   missForest iteration 2 in progress...done!
#>   missForest iteration 3 in progress...done!
#>   missForest iteration 4 in progress...done!

head(df_okay)
#>              y          x1         x2
#> 1  0.635826930  1.20173858  0.4385318
#> 2  0.006728643 -0.02163107  0.4746254
#> 3  0.070928781  1.57586872  1.1326571
#> 4 -0.425370934 -0.27158529 -0.8562253
#> 5 -0.265313414  0.21216662 -0.4754992
#> 6 -1.162756629 -0.11686152 -0.9114964

# This adjustment to the impute_mf function makes the initial code
# run without throwing an error. 

impute_mf_bcj<-function (dat, formula, ...) 
{
  stopifnot(inherits(formula, "formula"))
  if (simputation:::not_installed("missForest")) 
    return(dat)
  imputed <- simputation:::get_imputed(formula, dat)
  predictors <- simputation:::get_predictors(formula, dat, ...)
  vars <- unique(c(imputed, predictors))
  imp <- tryCatch(missForest::missForest(as.data.frame(dat[vars]), ...)[[1]], 
    error = function(e) {
      warnf("Could not execute missForest::missForest: %s\n Returning original data", 
        e$message)
      dat
    })
  if (length(imputed) == 0) {
    dat[vars] <- imp[vars]
  }
  else {
    dat[imputed] <- imp[imputed]
  }
  dat
}

df %>% impute_mf_bcj(formula=y~.)
#>   missForest iteration 1 in progress...done!
#>   missForest iteration 2 in progress...done!
#>   missForest iteration 3 in progress...done!
#>   missForest iteration 4 in progress...done!
#>   missForest iteration 5 in progress...done!
#>   missForest iteration 6 in progress...done!
#>   missForest iteration 7 in progress...done!
#> # A tibble: 100 x 3
#>          y       x1     x2
#>      <dbl>    <dbl>  <dbl>
#>  1  0.636    1.20    0.439
#>  2  0.272   -0.0216  0.475
#>  3  0.0709   1.58    1.13 
#>  4 -0.425   -0.272  -0.856
#>  5 -0.265    0.212  -0.475
#>  6 -1.16    -0.117  -0.911
#>  7  1.23     0.0315  0.824
#>  8  0.783   NA       1.01 
#>  9 -0.0249   1.02   -0.490
#> 10  0.481    0.888   2.35 
#> # ... with 90 more rows

Created on 2019-01-20 by the reprex package (v0.2.1)

barryrowlingson commented 5 years ago

impute_mf is just passing down what its given - why not file a report against the missForest function/ package? Otherwise you are just going to have a chain of useless as.data.frame calls protecting against people passing tibbles. Or you could accept that a tibble does not behave like a data frame, read the docs for functions and when they say "data frame", pass them a data frame. Do the conversion the once, at the user level.

bcjaeger commented 5 years ago

impute_mf is just passing down what its given - why not file a report against the missForest function/ package?

That would be perfectly reasonable. I brought it to your attention because the tibble is a construct of the tidyverse, and I think that the simputation belongs to that set of packages? (I may be wrong.) I thought it would make sense for a tidyverse package to make sure users could pass tibble objects into its functions without needing to manually translate them to data frames (and then maybe back to a tibble).

Otherwise you are just going to have a chain of useless as.data.frame calls protecting against people passing tibbles. Or you could accept that a tibble does not behave like a data frame, read the docs for functions and when they say "data frame", pass them a data frame. Do the conversion the once, at the user level.

I am happy to do these conversions at the user level if you do not think it should be automated. I agree with your sentiment. Analysts should always be cautious about object classes and read documentation carefully. Alternatively, I could try to write a few updates to the mf functions and send you a pull request. Either option is fine with me. Thank you for reading my thoughts and building a very nice set of functions for imputation.

markvanderloo commented 5 years ago

Hi, simputation is not part of the tidyverse (note the lack of a hex sticker ;-p). The reason for the package is that we needed a way to combine many fairly simple imputation methods.

I tried and tested several designs before simputation became what is is now. E.g. I experimented with an imputation DSL, an object model, and several other ideas. In the end I found this the easiest interface. It really just happened to coincide with the tidyverse principles (they're good principles). So I can see why users suspect that simputation is a part of it. Of course I did choose to recognize grouping as created by dplyr::group_by once I made that choice, since why not?

Regarding tibble. It is an unfortunate choice of its authors that a tibble formally inherits from data.frame (i.e. it tells every R function: 'hey, I'm a data.frame!') but at the same time alters its behavior. As consequence, and because of the popularity of tidyverse, package authors outside of the tidyverse are asked to update their packages, even when they do not depend or import the tibble package. I'm doubtful about what to do. The tidyverse packages are known to change interface fairly frequently (see e.g. this discussion and that could yield extra work for me, even though I'm not using any tibble functionality.

bcjaeger commented 5 years ago

Hi, I appreciate the difficulty of writing code that is impervious to changes in the tidy structure. It is not that much trouble to handle type conversion at the user level, so I don't think it's all that pressing to address this at the moment. Thank you for your thoughtful response.