Error in cbind2(1, newx) %*% nbeta : invalid class 'NA' to dup_mMatrix_as_dgeMatrix #200

Closed konradsemsch closed 5 years ago

konradsemsch commented 5 years ago

I took the example listed in this blogpost and tried to replicate it using glmnet: https://www.alexpghayes.com/blog/implementing-the-super-learner-with-tidymodels/

I wanted to use binary classification so I excluded one of the factor levels, but otherwise changed as minimal as possible in order to run it. When I'm getting to the part when I want to make predictions on the split's assessment set I get the following error:

Error in cbind2(1, newx) %*% nbeta : 
  invalid class 'NA' to dup_mMatrix_as_dgeMatrix

More specifically it's breaking in this part when I'm trying to make predictions on the hold-out set:

en_fits_cv_pred <- en_fits_cv %>%
    preds = future_pmap(list(fit, splits, prepped), predict_helper)

I was also trying to run the prediction using only 1 model fit to exclude the possibility of something breaking in the map, but the error perists:

predict(en_fits_cv$fit[[1]], new_data = juice(prep(en_rec, retain = TRUE)))

The full code I'm running is the following:


# Loading libraries -------------------------------------------------------


# Loading input dataset ---------------------------------------------------

df_all <- iris %>% 
  filter(Species != "setosa") %>% 
  mutate(Species = factor(Species, levels = c("versicolor", "virginica")))

# Dividing the dataset ----------------------------------------------------

df_train_cv <- vfold_cv(df_all, v = 5, repeats = 1)

# Preparing the recipes ----------------------------------------------------

# I need to add a custom step over here on the missing patterns

en_rec <- df_all %>% 
  recipe(Species ~ .) %>% 
  step_pca(all_predictors(), num_comp = 2)

# Training models withing resamples ---------------------------------------

fit_on_fold <- function(spec, prepped) {

  x <- juice(prepped, all_predictors())
  y <- juice(prepped, all_outcomes())

  fit_xy(spec, x, y)

en_engine <- logistic_reg(mode = "classification") %>% 

en_grid <- grid_regular(penalty, mixture, levels = c(2, 2))

en_spec <- tibble(spec = merge(en_engine, en_grid)) %>%  # combining model engine with different parameters
  mutate(model_id = row_number())

en_spec_cv <- crossing(df_train_cv, en_spec) # adding cross-validated folds

en_fits_cv <- en_spec_cv %>% # fitting different model specifications to different folds
    prepped = future_map(splits, prepper, en_rec),
    fit = future_map2(spec, prepped, fit_on_fold)

# Making holdout predictions ----------------------------------------------

predict_helper <- function(fit, new_data, recipe) {

  # new_data can either be an rsample::rsplit object
  # or a data frame of genuinely new data

  if (inherits(new_data, "rsplit")) {
    obs <- as.integer(new_data, data = "assessment")

    # never forget to bake when predicting with recipes!
    new_data <- bake(recipe, assessment(new_data))
  } else {
    obs <- 1:nrow(new_data)
    new_data <- bake(recipe, new_data)

  # if you want to generalize this code to a regression
  # super learner, you'd need to set `type = "response"` here

  predict(fit, new_data, type = "prob") %>% 
    mutate(obs = obs)

en_fits_cv_pred <- en_fits_cv %>%
    preds = future_pmap(list(fit, splits, prepped), predict_helper)

I've been looking for help around the internet but unfortunately I'm absolutely about where the root case could be. Could anyone assist?

My session info below:

topepo commented 5 years ago

Honestly, I have not idea. I rewrote the prediction helper function to be a little more simple and rearranged the arguments (odc :-/). I also added a performance metric below too.

We're working on model tuning right now that will make this a lot easier. The use of crossing() is fine but you probably won't have to do that once we have the better api in place.


# Loading libraries -------------------------------------------------------

#> Registered S3 method overwritten by 'rvest':
#>   method            from
#>   read_xml.response xml2
#> ── Attaching packages ──────────────────────────────────────────────────────── tidymodels 0.0.2 ──
#> ✔ broom     0.5.2       ✔ recipes   0.1.6  
#> ✔ dials     0.0.2       ✔ rsample   0.0.5  
#> ✔ infer     ✔ yardstick 0.0.3  
#> ✔ parsnip   0.0.3
#> ── Conflicts ─────────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ scales::discard()  masks purrr::discard()
#> ✖ tidyr::extract()   masks magrittr::extract()
#> ✖ dplyr::filter()    masks stats::filter()
#> ✖ recipes::fixed()   masks stringr::fixed()
#> ✖ dplyr::lag()       masks stats::lag()
#> ✖ purrr::set_names() masks magrittr::set_names()
#> ✖ yardstick::spec()  masks readr::spec()
#> ✖ recipes::step()    masks stats::step()
#> Loading required package: future

# Loading input dataset ---------------------------------------------------

df_all <- iris %>% 
  filter(Species != "setosa") %>% 
  mutate(Species = factor(Species, levels = c("versicolor", "virginica")))

# Dividing the dataset ----------------------------------------------------

df_train_cv <- vfold_cv(df_all, v = 5, repeats = 1)

# Preparing the recipes ----------------------------------------------------

# I need to add a custom step over here on the missing patterns

en_rec <- df_all %>% 
  recipe(Species ~ .) %>% 
  step_pca(all_predictors(), num_comp = 2)

# Training models withing resamples ---------------------------------------

fit_on_fold <- function(spec, prepped) {

  x <- juice(prepped, all_predictors())
  y <- juice(prepped, all_outcomes())

  fit_xy(spec, x, y)

en_engine <- logistic_reg(mode = "classification") %>% 

en_grid <- grid_regular(penalty, mixture, levels = c(2, 2))

en_spec <- tibble(spec = merge(en_engine, en_grid)) %>%  # combining model engine with different parameters
  mutate(model_id = row_number())

en_spec_cv <- crossing(df_train_cv, en_spec) # adding cross-validated folds

en_fits_cv <- en_spec_cv %>% # fitting different model specifications to different folds
    prepped = future_map(splits, prepper, en_rec),
    fit = future_map2(spec, prepped, fit_on_fold)

predict_helper <- function(split, recipe, fit) {

  new_x <- bake(recipe, new_data = assessment(split), all_predictors())

  predict(fit, new_x, type = "prob") %>% 
    bind_cols(assessment(split) %>% select(Species)) 

en_fits_cv_pred <- en_fits_cv %>%
    preds = future_pmap(list(splits, prepped, fit), predict_helper)

indiv_estimates <- 
  en_fits_cv_pred %>% 
  unnest(preds) %>% 
  group_by(id, model_id) %>% 
  # or some other performance measure:
  mn_log_loss(truth = Species, .pred_virginica)

rs_estimates <- 
  indiv_estimates %>% 
  group_by(model_id, .metric, .estimator) %>% 
  summarize(mean = mean(.estimate, na.rm = TRUE))

#> # A tibble: 4 x 4
#> # Groups:   model_id, .metric [4]
#>   model_id .metric     .estimator  mean
#>      <int> <chr>       <chr>      <dbl>
#> 1        1 mn_log_loss binary     2.36 
#> 2        2 mn_log_loss binary     0.938
#> 3        3 mn_log_loss binary     9.45 
#> 4        4 mn_log_loss binary     0.691

Created on 2019-07-31 by the reprex package (v0.2.1)

konradsemsch commented 5 years ago

Thanks @topepo for taking a look!

