Closed sda030 closed 1 year ago
Thanks for bringing this up! It may be a good idea to import tidyselect and use proper data-masking here.
Actually, you are already importing rlang (with data masking) and tidyselect through dplyr, so no new dependencies really. And also, yes, data masking (.data[["age"]]) is perhaps more meaningful as you only allow single arguments.
Yeah, not worried on dependency heaviness for this one!
EDIT: Ha, you're ahead of me on tidyselection lifecycle! Didn't realize .data
had been deprecated in tidyselection.
As I work through this, noting-to-self a few oddities of the current all.vars()
solution to column selection via formula. Note that:
library(rlang)
f_rhs(college ~ "age")
#> [1] "age"
all.vars(f_rhs(college ~ "age"))
#> character(0)
As a result:
library(infer)
# "age" is not symbolic
specify(gss, college ~ "age")
#> Error in `specify()`:
#> ! The explanatory should be a bare variable name (not a string in quotation marks).
#> Backtrace:
#> ▆
#> 1. └─infer::specify(gss, college ~ "age")
#> 2. └─infer:::parse_variables(x, formula, response, explanatory)
#> 3. └─infer:::check_var_correct(x, "explanatory", call = call)
#> 4. └─rlang::abort(...)
# "age" and a nonexistent column is, but the helper
# can't handle multiple explanatory variables
specify(gss, college ~ "age" + nonexistent_column)
#> Error in `specify()`:
#> ! The explanatory variable `+` cannot be found in this dataframe.
#> • The explanatory variable `age` cannot be found in this dataframe.
#> • The explanatory variable `nonexistent_column` cannot be found in this dataframe.
#> Backtrace:
#> ▆
#> 1. └─infer::specify(gss, college ~ "age" + nonexistent_column)
#> 2. └─infer:::parse_variables(x, formula, response, explanatory)
#> 3. └─infer:::check_var_correct(x, "explanatory", call = call)
#> 4. └─rlang::abort(...)
# doesn't trigger that error, though, with at least one
# valid column name and no invalid symbolics, since
# all.vars(RHS) == "year"
specify(gss, college ~ "age" + year)
#> Response: college (factor)
#> Explanatory: year (numeric)
#> # A tibble: 500 × 2
#> college year
#> <fct> <dbl>
#> 1 degree 2014
#> 2 no degree 1994
#> 3 degree 1998
#> 4 no degree 1996
#> 5 degree 1994
#> 6 no degree 1996
#> 7 no degree 1990
#> 8 degree 2016
#> 9 degree 2000
#> 10 no degree 1998
#> # ℹ 490 more rows
specify(gss, college ~ "age" + year + nonexistent_column)
#> Error in `specify()`:
#> ! The explanatory variable `+` cannot be found in this dataframe.
#> • The explanatory variable `"age" + year` cannot be found in this dataframe.
#> • The explanatory variable `nonexistent_column` cannot be found in this dataframe.
#> Backtrace:
#> ▆
#> 1. └─infer::specify(gss, college ~ "age" + year + nonexistent_column)
#> 2. └─infer:::parse_variables(x, formula, response, explanatory)
#> 3. └─infer:::check_var_correct(x, "explanatory", call = call)
#> 4. └─rlang::abort(...)
Created on 2023-05-24 with reprex v2.0.2
The only other established functionality for column selection via formula in the tidymodels that I'm aware of is in recipes. It errors (via base R terms()
) in all of the above cases:
recipes:::get_rhs_vars(college ~ "age", infer::gss)
#> Error in terms.formula(formula, data = data): invalid model formula in ExtractVars
recipes:::get_rhs_vars(college ~ "age" + nonexistent_column, infer::gss)
#> Error in terms.formula(formula, data = data): invalid model formula in ExtractVars
recipes:::get_rhs_vars(college ~ "age" + year, infer::gss)
#> Error in terms.formula(formula, data = data): invalid model formula in ExtractVars
recipes:::get_rhs_vars(college ~ "age" + year + nonexistent_column, infer::gss)
#> Error in terms.formula(formula, data = data): invalid model formula in ExtractVars
Created on 2023-05-24 with reprex v2.0.2
This feels to me like a possible argument for not transitioning to tidyselect under the hood in infer, as there's not a well-defined tidyselect procedure for formulas, and trying to write one would either require inconsistency with tidyselect (not allowing strings) or with base R (and thus tidymodels).
After stewing with this for a while longer, I think the possibility of specify()
ing via formula indeed means consistent tidyselection with infer is not well-defined. I appreciate you raising this issue, and believe we ought to revisit if at some point there is a proper spec for the interaction between tidyselect and formulae.
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
It would be nice to allow tidyverse/rlang quosures for the arguments
response =
andexplanatory =
inspecify()
. Currently only bare names are allowed. This would greatly help creation of downstream packages.Created on 2023-04-22 with reprex v2.0.2