juliasilge / supervised-ML-case-studies-course

Supervised machine learning case studies in R! đź’« A free interactive tidymodels course
https://supervised-ml-course.netlify.app/
MIT License
221 stars 76 forks source link

Factor/character errors #47

Closed msevi closed 4 years ago

msevi commented 4 years ago

Hello! I'm going over Chapter 1, In section 8 & 9:

results <- car_test %>%
    mutate(MPG = log(MPG)) %>%
    bind_cols(predict(fit_lm, car_test) %>%
                  rename(.pred_lm = .pred)) %>%
    bind_cols(predict(fit_rf, car_test) %>%
                  rename(.pred_rf = .pred))

Produces:

Error in predict.randomForest(object = object$fit, newdata = new_data) : 
  New factor levels not present in the training data

This was solved prior to splitting data by

car_vars<- cars2018 %>%
  select(-Model, -`Model Index`) %>% 
  mutate(across(where(is.character), as.factor))

However, in section 11


car_boot<- bootstraps(car_train)
rf_res <- rf_mod %>%
    fit_resamples(
        MPG ~ .,
        resamples = car_boot,
        control = control_resamples(save_pred = TRUE)
    )

produces

x Bootstrap01: formula: Error: Functions involving factors or characters have been detected on the RHS of formula. These are not allowed when indicators = "none". Functions involving factors were detected for the following columns: 'Lockup Torque Converter', 'Recommended Fuel', 'Fuel injection'.

I did notice that the Tidymodels version for the course is 0.1.0 and mine is 0.1.1 Is it just a version issue or do you have any advice on how to solve the previous error message?

Regards, Maria

msevi commented 4 years ago

Update:

I've downloaded the corresponding RDS file for car_train , and running the code exactly as in Section 11 of Chapter 1, I still get the error:

formula: Error: Functions involving factors or characters have been detected on the RHS of `formula`. These are not allowed when `indicators = "none"`. Functions involving factors were detected for the following columns: 'Lockup Torque Converter', 'Recommended Fuel', 'Fuel injection'.
--
juliasilge commented 4 years ago

You are correct that this is due to some changes in tidymodels, particularly how parsnip handles the predictor encodings and model.matrix() business under the hood. A goal I have with this first chapter is to not introduce too many things at once, so I may need to change up a few aspects of how the data is stored to reduce this tension.

There are two things going on here.

I need to update the chapter for all of this but in the meantime, you can do:

car_vars <- cars2018 %>%
    select(-Model, -`Model Index`) %>%
    janitor::clean_names() %>%
    mutate_if(is.character, factor)

or with across() like you showed.

When I do this, both predict() and fit_resamples() works.

rf_mod %>%
    fit_resamples(
        log(mpg) ~ .,
        car_boot,
        control = control_resamples(save_pred = TRUE)
    )
# Resampling results
# Bootstrap sampling 
# A tibble: 25 x 5
   splits           id          .metrics        .notes         .predictions     
   <list>           <chr>       <list>          <list>         <list>           
 1 <split [917/343… Bootstrap01 <tibble [2 × 3… <tibble [0 × … <tibble [343 × 3…
 2 <split [917/330… Bootstrap02 <tibble [2 × 3… <tibble [0 × … <tibble [330 × 3…
 3 <split [917/346… Bootstrap03 <tibble [2 × 3… <tibble [0 × … <tibble [346 × 3…
 4 <split [917/335… Bootstrap04 <tibble [2 × 3… <tibble [0 × … <tibble [335 × 3…
 5 <split [917/345… Bootstrap05 <tibble [2 × 3… <tibble [0 × … <tibble [345 × 3…
 6 <split [917/351… Bootstrap06 <tibble [2 × 3… <tibble [0 × … <tibble [351 × 3…
 7 <split [917/342… Bootstrap07 <tibble [2 × 3… <tibble [0 × … <tibble [342 × 3…
 8 <split [917/322… Bootstrap08 <tibble [2 × 3… <tibble [0 × … <tibble [322 × 3…
 9 <split [917/330… Bootstrap09 <tibble [2 × 3… <tibble [0 × … <tibble [330 × 3…
10 <split [917/342… Bootstrap10 <tibble [2 × 3… <tibble [0 × … <tibble [342 × 3…
# … with 15 more rows

Thanks for the report! 🙌

msevi commented 4 years ago

Awesome! Thank you so much. Confirming that it works :)

tanthiamhuat commented 3 years ago

I encounter the same error as what was mentioned. Initially, I have thought of fixing car_train as below will work: car_train <- car_train %>% mutate_if(is.character,as.factor) as the affected code is with car_train on the RF portion: results <- car_train %>% mutate(mpg = log(mpg)) %>% bind_cols(predict(fit_lm, car_train) %>% rename(.pred_lm = .pred)) %>% bind_cols(predict(fit_rf, car_train) %>% rename(.pred_rf = .pred))

But above does not solve the error. Why? But your solution works below: car_vars <- cars2018 %>% select(-Model, -Model Index) %>% janitor::clean_names() %>% mutate_if(is.character, factor)

juliasilge commented 3 years ago

The reason that just converting from character to factor doesn't solve the problem is that some of the column names have spaces in them, which does not play well with the internals of, I think, the randomForest package. We can fix this by using janitor::clean_names() to make all the column names nicer.