tidymodels / infer

An R package for tidyverse-friendly statistical inference
https://infer.tidymodels.org
Other
726 stars 80 forks source link

`specify()` not capable of tidyverse programming #495

Closed sda030 closed 1 year ago

sda030 commented 1 year ago

It would be nice to allow tidyverse/rlang quosures for the arguments response = and explanatory = in specify(). Currently only bare names are allowed. This would greatly help creation of downstream packages.

library(dplyr)
library(infer)
tmp <- function(response, explanatory) {
    infer::gss %>%
    infer::specify(response = {{response}}, explanatory = {{explanatory}})
}
tmp(response=age, explanatory = partyid)
#> Dropping unused factor levels DK from the supplied explanatory variable 'partyid'.
#> Response: age (numeric)
#> Explanatory: partyid (factor)
#> # A tibble: 500 × 2
#>      age partyid
#>    <dbl> <fct>  
#>  1    36 ind    
#>  2    34 rep    
#>  3    24 ind    
...
#> # ℹ 490 more rows

tmp(response=tidyselect::all_of("age"), explanatory = tidyselect::all_of("partyid"))
#> Error in `x[[response_name(x)]]`:
#> ! Can't extract column with `response_name(x)`.
#> ✖ Subscript `response_name(x)` must be size 1, not 2.
#> Backtrace:
#>      ▆
#>   1. ├─global tmp(response = tidyselect::all_of("age"), explanatory = tidyselect::all_of("partyid"))
#>   2. │ └─infer::gss %>% ...
#>   3. └─infer::specify(...)
#>   4.   └─infer:::parse_variables(x, formula, response, explanatory)
#>   5.     └─infer:::response_variable(x)
#>   6.       ├─x[[response_name(x)]]
#>   7.       └─tibble:::`[[.tbl_df`(x, response_name(x))
#>   8.         └─tibble:::tbl_subset2(x, j = i, j_arg = substitute(i))
#>   9.           └─tibble:::vectbl_as_col_location2(...)
#>  10.             ├─tibble:::subclass_col_index_errors(...)
#>  11.             │ └─base::withCallingHandlers(...)
#>  12.             └─vctrs::vec_as_location2(j, n, names, call = call)
#>  13.               └─vctrs:::result_get(...)
#>  14.                 └─rlang::cnd_signal(x$err)

Created on 2023-04-22 with reprex v2.0.2

simonpcouch commented 1 year ago

Thanks for bringing this up! It may be a good idea to import tidyselect and use proper data-masking here.

sda030 commented 1 year ago

Actually, you are already importing rlang (with data masking) and tidyselect through dplyr, so no new dependencies really. And also, yes, data masking (.data[["age"]]) is perhaps more meaningful as you only allow single arguments.

simonpcouch commented 1 year ago

Yeah, not worried on dependency heaviness for this one!

EDIT: Ha, you're ahead of me on tidyselection lifecycle! Didn't realize .data had been deprecated in tidyselection.

simonpcouch commented 1 year ago

As I work through this, noting-to-self a few oddities of the current all.vars() solution to column selection via formula. Note that:

library(rlang)

f_rhs(college ~ "age")
#> [1] "age"
all.vars(f_rhs(college ~ "age"))
#> character(0)

As a result:

library(infer)

# "age" is not symbolic
specify(gss, college ~ "age")
#> Error in `specify()`:
#> ! The explanatory should be a bare variable name (not a string in quotation marks).
#> Backtrace:
#>     ▆
#>  1. └─infer::specify(gss, college ~ "age")
#>  2.   └─infer:::parse_variables(x, formula, response, explanatory)
#>  3.     └─infer:::check_var_correct(x, "explanatory", call = call)
#>  4.       └─rlang::abort(...)

# "age" and a nonexistent column is, but the helper 
# can't handle multiple explanatory variables
specify(gss, college ~ "age" + nonexistent_column)
#> Error in `specify()`:
#> ! The explanatory variable `+` cannot be found in this dataframe.
#> • The explanatory variable `age` cannot be found in this dataframe.
#> • The explanatory variable `nonexistent_column` cannot be found in this dataframe.
#> Backtrace:
#>     ▆
#>  1. └─infer::specify(gss, college ~ "age" + nonexistent_column)
#>  2.   └─infer:::parse_variables(x, formula, response, explanatory)
#>  3.     └─infer:::check_var_correct(x, "explanatory", call = call)
#>  4.       └─rlang::abort(...)

# doesn't trigger that error, though, with at least one
# valid column name and no invalid symbolics, since
# all.vars(RHS) == "year"
specify(gss, college ~ "age" + year)
#> Response: college (factor)
#> Explanatory: year (numeric)
#> # A tibble: 500 × 2
#>    college    year
#>    <fct>     <dbl>
#>  1 degree     2014
#>  2 no degree  1994
#>  3 degree     1998
#>  4 no degree  1996
#>  5 degree     1994
#>  6 no degree  1996
#>  7 no degree  1990
#>  8 degree     2016
#>  9 degree     2000
#> 10 no degree  1998
#> # ℹ 490 more rows

specify(gss, college ~ "age" + year + nonexistent_column)
#> Error in `specify()`:
#> ! The explanatory variable `+` cannot be found in this dataframe.
#> • The explanatory variable `"age" + year` cannot be found in this dataframe.
#> • The explanatory variable `nonexistent_column` cannot be found in this dataframe.
#> Backtrace:
#>     ▆
#>  1. └─infer::specify(gss, college ~ "age" + year + nonexistent_column)
#>  2.   └─infer:::parse_variables(x, formula, response, explanatory)
#>  3.     └─infer:::check_var_correct(x, "explanatory", call = call)
#>  4.       └─rlang::abort(...)

Created on 2023-05-24 with reprex v2.0.2

simonpcouch commented 1 year ago

The only other established functionality for column selection via formula in the tidymodels that I'm aware of is in recipes. It errors (via base R terms()) in all of the above cases:

recipes:::get_rhs_vars(college ~ "age", infer::gss)
#> Error in terms.formula(formula, data = data): invalid model formula in ExtractVars
recipes:::get_rhs_vars(college ~ "age" + nonexistent_column, infer::gss)
#> Error in terms.formula(formula, data = data): invalid model formula in ExtractVars
recipes:::get_rhs_vars(college ~ "age" + year, infer::gss)
#> Error in terms.formula(formula, data = data): invalid model formula in ExtractVars
recipes:::get_rhs_vars(college ~ "age" + year + nonexistent_column, infer::gss)
#> Error in terms.formula(formula, data = data): invalid model formula in ExtractVars

Created on 2023-05-24 with reprex v2.0.2

This feels to me like a possible argument for not transitioning to tidyselect under the hood in infer, as there's not a well-defined tidyselect procedure for formulas, and trying to write one would either require inconsistency with tidyselect (not allowing strings) or with base R (and thus tidymodels).

simonpcouch commented 1 year ago

After stewing with this for a while longer, I think the possibility of specify()ing via formula indeed means consistent tidyselection with infer is not well-defined. I appreciate you raising this issue, and believe we ought to revisit if at some point there is a proper spec for the interaction between tidyselect and formulae.

github-actions[bot] commented 1 year ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.