Closed cimentadaj closed 3 years ago
The issue is related to how you are passing the gear
column in. Keeping it as character gives you different factor levels since the character is converted to factor after the data have been split. So one data set has levels "3" and "4" and then, in the new data, a new value of "5" is given to it.
It happens in some folds and not on others since this is a very small data set and you sometimes end up sampling-out a factor level. So it is random since resampling is random.
If you convert to factor (instead of character), the problem goes away since the factor is aware of all possible levels from the start:
library(tidymodels)
#> ── Attaching packages ───────────────────────────────────── tidymodels 0.1.1 ──
#> ✓ broom 0.7.0 ✓ recipes 0.1.13
#> ✓ dials 0.0.8.9000 ✓ rsample 0.0.7
#> ✓ dplyr 1.0.1 ✓ tibble 3.0.3
#> ✓ ggplot2 3.3.2 ✓ tidyr 1.1.1
#> ✓ infer 0.5.2 ✓ tune 0.1.1.9000
#> ✓ modeldata 0.0.2 ✓ workflows 0.1.3
#> ✓ parsnip 0.1.3 ✓ yardstick 0.0.7
#> ✓ purrr 0.3.4
#> Warning: package 'recipes' was built under R version 4.0.2
#> Warning: package 'rsample' was built under R version 4.0.2
#> Warning: package 'workflows' was built under R version 4.0.2
#> Warning: package 'yardstick' was built under R version 4.0.2
#> ── Conflicts ──────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
#> x recipes::step() masks stats::step()
linear_spec <-
linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
mt_split <- initial_split(mtcars[c("mpg", "gear")])
mt_train <- as_tibble(training(mt_split))
mt_fold <-
mt_train %>%
mutate(gear = factor(gear)) %>%
vfold_cv(v = 10)
# Returns results with errors/warnings
linear_spec %>%
fit_resamples(
mpg ~ gear,
resamples = mt_fold
)
#> ! Fold01: internal: A correlation computation is required, but `estimate` is const...
#> ! Fold02: internal: A correlation computation is required, but `estimate` is const...
#> ! Fold04: internal: A correlation computation is required, but `estimate` is const...
#> ! Fold05: internal: A correlation computation is required, but `truth` is constant...
#> ! Fold08: internal: A correlation computation is required, but `estimate` is const...
#> Warning: This tuning result has notes. Example notes on model fitting include:
#> internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 error. `NA` will be returned.
#> internal: A correlation computation is required, but `truth` is constant and has 0 standard deviation, resulting in a divide by 0 error. `NA` will be returned.
#> internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 error. `NA` will be returned.
#> # Resampling results
#> # 10-fold cross-validation
#> # A tibble: 10 x 4
#> splits id .metrics .notes
#> <list> <chr> <list> <list>
#> 1 <split [21/3]> Fold01 <tibble [2 × 3]> <tibble [1 × 1]>
#> 2 <split [21/3]> Fold02 <tibble [2 × 3]> <tibble [1 × 1]>
#> 3 <split [21/3]> Fold03 <tibble [2 × 3]> <tibble [0 × 1]>
#> 4 <split [21/3]> Fold04 <tibble [2 × 3]> <tibble [1 × 1]>
#> 5 <split [22/2]> Fold05 <tibble [2 × 3]> <tibble [1 × 1]>
#> 6 <split [22/2]> Fold06 <tibble [2 × 3]> <tibble [0 × 1]>
#> 7 <split [22/2]> Fold07 <tibble [2 × 3]> <tibble [0 × 1]>
#> 8 <split [22/2]> Fold08 <tibble [2 × 3]> <tibble [1 × 1]>
#> 9 <split [22/2]> Fold09 <tibble [2 × 3]> <tibble [0 × 1]>
#> 10 <split [22/2]> Fold10 <tibble [2 × 3]> <tibble [0 × 1]>
Created on 2020-08-17 by the reprex package (v0.3.0)
The new warning that is issued is due to the small data set since sampling-out 2 of the three levels of gear
. This results in an intercept only model, which predicts the same value for all samples. Since the R^2 statistic depends on the variance of the predicted values, it ends up dividing by zero (and issuing the warning).
To add to Max's answer
This is expected, as character columns are not expanded to dummies
I don't think this is quite right. Character columns are converted to factors and are then expanded. You can see that Country has been expanded in the coefficients below
suppressPackageStartupMessages({
library(rsample)
library(parsnip)
library(tune)
library(dplyr)
library(modeldata)
})
data(stackoverflow)
linear_spec <-
linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
so_split <- initial_split(stackoverflow[c("Salary", "Country")])
so_train <- training(so_split)
# Convert factor to character
so_fold <-
mutate(so_train, Country = as.character(Country)) %>%
vfold_cv(v = 10)
# Returns results without errors/warnings
mods <- linear_spec %>%
fit_resamples(
Salary ~ Country,
resamples = so_fold,
control = control_resamples(extract = identity)
)
mods$.extracts[[1]]$.extracts[[1]]$fit$fit$fit
#>
#> Call:
#> stats::lm(formula = ..y ~ ., data = data)
#>
#> Coefficients:
#> (Intercept) CountryGermany CountryIndia
#> 56795 -4576 -45025
#> `CountryUnited Kingdom` `CountryUnited States`
#> -2720 41485
Created on 2020-08-17 by the reprex package (v0.3.0.9001)
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
The problem
In light of https://github.com/tidymodels/tune/issues/151, I'm trying to run a resampling of a continuous variable against a character column without one-hot encoding the character column. I took the stackoverflow example from https://github.com/tidymodels/tune/issues/151 and found that it worked. However, once I replicated the exact same thing for
mtcars
, it raises an error.Reproducible example
Here's the example using the stackoverflow data:
This is expected, as character columns are not expanded to dummies. However, if I replace the above with
mtcars
, it raises the typical one-hot encoding problem of not finding variables defined in the formula:I assume this is not expected, right? Some thoughts:
From what I've read in https://github.com/tidymodels/workflows/pull/53, https://github.com/tidymodels/parsnip/pull/332 and https://github.com/tidymodels/hardhat/pull/140, one-hot encoding will only happen with factor columns, in case it is specified in
default_formula_blueprint
. I think this shouldn't happen with character columns, as it's happening now.Surprisingly, the previous error happens in some folds but not on all folds.
Since I know there have been recent merges related to the problem, I installed all latest Github versions of
parsnip
,tune
,hardhat
andrsample
. Here's my SI: