marjoleinF / pre

an R package for deriving Prediction Rule Ensembles
58 stars 17 forks source link

predict on tibbles #28

Closed markhwhiteii closed 3 years ago

markhwhiteii commented 3 years ago

I noticed that predict throws an error on tibbles:

> class(dat_test)
[1] "tbl_df"     "tbl"        "data.frame"

> head(predict(m1, dat_test, type = "class"))
Error: Assigned data `as.numeric(data[, sapply(data, is.ordered)])` must be compatible with existing data.
x Existing data has 258 rows.
x Assigned data has 0 rows.
ℹ Only vectors of size 1 are recycled.
Run `rlang::last_error()` to see where the error occurred.

> head(predict(m1, as.data.frame(dat_test), type = "class"))
       1        2        3        4        5        6 
"oppose" "oppose" "oppose" "oppose" "oppose" "oppose" 
marjoleinF commented 3 years ago

Thanks for the report. You did not include a reproducible example, so have to guess where this issue comes from.

This may occur because you are not providing a data.frame to the predict.pre method; as per the documentation, argument newdata should supply a data.frame. Inputting a tibble is at your own risk.

(The issue may be due to more limited range of column subsetting methods supported by tibbles. See also https://r4ds.had.co.nz/tibbles.html#interacting-with-older-code. I do not subscribe to the notion that "We don’t use [ ... because dplyr::filter() and dplyr::select() allow you to solve the same problems with clearer code.")

To fix the issue, this might help:

dat_test <- as.data.frame(dat_test)
predict(m1, dat_test, type = "class")

Also, I made some changes to the latest development version that may fix the issue (without using extra line of code as per above). Would be happy to hear if this works (cannot test myself b/c no reproducible example). Can be installed using:

library("devtools")
install_github("marjoleinF/pre")
markhwhiteii commented 3 years ago

Sorry for the lack of reproducible example—I was in the middle of something and was gonna add one today, but you beat me to it. Check out:

library(pre)
library(tidyverse)
set.seed(1839)
n <- 1000
X <- cbind(1, replicate(20, rnorm(n)))
b <- runif(21)
y <- X %*% b + rnorm(n)

# make tibble
dat_tbl <- X[, -1] %>% # drop intercept
  as_tibble() %>% 
  mutate(y = as.vector(y))

# make data.frame
dat_df <- as.data.frame(dat_tbl)

# check
class(dat_tbl)
class(dat_df)

# make train/test split
training <- sample(seq_len(n), n * .75)

# make pre with a tibble
m1 <- pre(y ~ ., dat_tbl[training, ])

# predict on tbl--note the error
predict(m1, dat_tbl[-training, ])

# and gone when we use data.frame
predict(m1, dat_df[-training, ])

The first call to predict() should throw the error:

Error: Assigned data `as.numeric(data[, sapply(data, is.ordered)])` must be compatible with existing data.
x Existing data has 250 rows.
x Assigned data has 0 rows.
ℹ Only vectors of size 1 are recycled.
Run `rlang::last_error()` to see where the error occurred.

While the second predict() call will run.

It's weird, because most of the subsetting stuff with tibbles works. One doesn't need to use filter() and select(), they can still use the [ notation for subsetting on tibbles.

My session information:

> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.4

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.6     purrr_0.3.4     readr_1.4.0     tidyr_1.1.3    
 [7] tibble_3.1.2    ggplot2_3.3.3   tidyverse_1.3.1 pre_1.0.0      

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6         lubridate_1.7.10   mvtnorm_1.1-1      lattice_0.20-44    plotmo_3.6.0      
 [6] earth_5.3.0        assertthat_0.2.1   glmnet_4.1-1       foreach_1.5.1      utf8_1.2.1        
[11] R6_2.5.0           cellranger_1.1.0   backports_1.2.1    MatrixModels_0.5-0 reprex_2.0.0      
[16] httr_1.4.2         pillar_1.6.1       TeachingDemos_2.12 rlang_0.4.11       readxl_1.3.1      
[21] rstudioapi_0.13    rpart_4.1-15       Matrix_1.3-3       partykit_1.2-13    splines_4.1.0     
[26] munsell_0.5.0      broom_0.7.6        compiler_4.1.0     modelr_0.1.8       pkgconfig_2.0.3   
[31] shape_1.4.6        libcoin_1.0-8      tidyselect_1.1.1   codetools_0.2-18   fansi_0.5.0       
[36] withr_2.4.2        crayon_1.4.1       dbplyr_2.1.1       grid_4.1.0         jsonlite_1.7.2    
[41] gtable_0.3.0       lifecycle_1.0.0    DBI_1.1.1          magrittr_2.0.1     scales_1.1.1      
[46] cli_2.5.0          stringi_1.6.2      fs_1.5.0           xml2_1.3.2         ellipsis_0.3.2    
[51] generics_0.1.0     vctrs_0.3.8        Formula_1.2-4      iterators_1.0.13   tools_4.1.0       
[56] glue_1.4.2         hms_1.1.0          plotrix_3.8-1      survival_3.2-11    colorspace_2.0-1  
[61] rvest_1.0.0        inum_1.0-4         haven_2.4.1  

I updated using install_github() to 1.0.1 and get the same error:

> # predict on tbl--note the error
> predict(m1, dat_tbl[-training, ])
Error: Assigned data `as.numeric(data[, sapply(data, is.ordered)])` must be compatible with existing data.
x Existing data has 250 rows.
x Assigned data has 0 rows.
ℹ Only vectors of size 1 are recycled.
Run `rlang::last_error()` to see where the error occurred.
> packageVersion("pre")
[1] ‘1.0.1’

Would it be feasible to do something like:

if (inherits(dat_tbl, c("tbl_df", "tbl"))) data <- as.data.frame(data)

Given the widespread use of the tidyverse? I'm currently working on writing a chapter where we use the tidyverse for cleaning and arranging data, and we're also including pre() for getting interpretable—but predictive—models. If not, no worries, I can always just tell folks to wrap their data in as.data.frame() in predict (like I did in my original post).

Thanks!

marjoleinF commented 3 years ago

Thanks! Helpful example, fully agree with need to accommodate tidyverse use! Your example also sheds more light on difference between data.frame and tibbles, must admit there is something to say for the error tibble throws. Have implemented your suggestion; available in current devel version (github), and is now on it's way to CRAN b/c issue report coincided with other updates.

marjoleinF commented 3 years ago

On CRAN now!