tidyverse / tidyr

Tidy Messy Data
https://tidyr.tidyverse.org/
Other
1.38k stars 420 forks source link

Tidyselect support in complete and friends #1397

Open MatthieuStigler opened 2 years ago

MatthieuStigler commented 2 years ago

This is basically re-opening issue https://github.com/tidyverse/tidyr/issues/1032, as suggested by https://github.com/tidyverse/tidyr/issues/1032#issuecomment-1167210911.

Question was how to use strings in complete, Hadley's answer was to use any_of() or all_of, though it doesn't seem to work, or is unclear how to use?

library(tibble)
library(tidyr)

df <- tibble(
  group = c(1:2, 1, 2),
  item_id = c(1:2, 2, 3),
  item_name = c("a", "a", "b", "b"),
  value1 = c(1, NA, 3, 4),
  value2 = 4:7
)

complete(df, group, item_id, item_name)
#> # A tibble: 12 × 5
#>    group item_id item_name value1 value2
#>    <dbl>   <dbl> <chr>      <dbl>  <int>
#>  1     1       1 a              1      4
#>  2     1       1 b             NA     NA
#>  3     1       2 a             NA     NA
#>  4     1       2 b              3      6
#>  5     1       3 a             NA     NA
#>  6     1       3 b             NA     NA
#>  7     2       1 a             NA     NA
#>  8     2       1 b             NA     NA
#>  9     2       2 a             NA      5
#> 10     2       2 b             NA     NA
#> 11     2       3 a             NA     NA
#> 12     2       3 b              4      7
complete(df, all_of(c("group", "item_id", "item_name")))
#> Error in `dplyr::full_join()`:
#> ! Join columns must be present in data.
#> ✖ Problem with `all_of(c("group", "item_id", "item_name"))`.

#> Backtrace:
#>     ▆
#>  1. ├─tidyr::complete(df, all_of(c("group", "item_id", "item_name")))
#>  2. └─tidyr:::complete.data.frame(...)
#>  3.   ├─dplyr::full_join(out, data, by = names)
#>  4.   └─dplyr:::full_join.data.frame(out, data, by = names)
#>  5.     └─dplyr:::join_mutate(...)
#>  6.       └─dplyr:::join_cols(...)
#>  7.         └─dplyr:::check_join_vars(by$y, y_names, by$condition, keep, error_call = error_call)
#>  8.           └─rlang::abort(bullets, call = error_call)
complete(df, any_of(c("group", "item_id", "item_name")))
#> Error:
#> ! `any_of()` must be used within a *selecting* function.
#> ℹ See <https://tidyselect.r-lib.org/reference/faq-selection-context.html>.

#> Backtrace:
#>      ▆
#>   1. ├─tidyr::complete(df, any_of(c("group", "item_id", "item_name")))
#>   2. ├─tidyr:::complete.data.frame(...)
#>   3. │ ├─tidyr::expand(data, ...)
#>   4. │ └─tidyr:::expand.data.frame(data, ...)
#>   5. │   └─tidyr:::grid_dots(..., `_data` = data)
#>   6. │     └─rlang::eval_tidy(dot, data = mask)
#>   7. └─tidyselect::any_of(c("group", "item_id", "item_name"))
#>   8.   ├─vars %||% peek_vars(fn = "any_of")
#>   9.   └─tidyselect::peek_vars(fn = "any_of")
#>  10.     └─rlang::abort(msg, call = NULL)

Created on 2022-09-22 with reprex v2.0.2

karchern commented 2 years ago

It's a shame and also quite weird that this still isn't fixed.

In the mean time, users can use this:

your_tibble %>%
complete(!!!rlang::syms(your_vector_with_col_names))
hadley commented 2 years ago

I suspect that answer made sense at the time, but expand() has been reimplemented since then (or I was just confused and it never worked). This looks like it will be a little tricky to fix because complete() uses expand() which use grid_dots() which is a data-masking function. Selecting variables in a data-masking function is usually done with across() but because grid_dots() is totally custom, that doesn't work. Fixing this will require some thought.

MatthieuStigler commented 2 years ago

thanks for the explanation!

I understand this is not a straightforward fix, but it would be great to be able to use across() and the beautiful tidyselect machinery for complete and expand !

DavisVaughan commented 1 year ago

One somewhat promising idea is to make expand() work by evaluating each ... separately using dplyr::reframe(data, quo1), dplyr::reframe(data, quo2), etc, capturing all the results (and extracting out the single columns into actual vectors), and then passing all of the results on to expand_grid() to actually do the expansion.

Since complete() uses expand() then we'd be able to do complete(df, pick(all_of(vars))) and most of the other features of reframe().

We'd lose the ability for the ...s to be evaluated in such a way that dot 2 can access the results of dot 1, but I don't think that is very important (this is what we copied from tibble, probably without really thinking it through).


We'd also make crossing(), nesting(), and expand_grid() all use list2() rather than grid_dots(), so those also wouldn't be able to access previous results, but then we could remove the rather hacky grid_dots() that tries to imitate what tibble() does, which would really simplify the tidyr internals and prevent most of these issues going forward.

Making expand_grid() use list2() would also solve #1394


Since we'd unconditionally need dplyr::reframe(), this has to wait until after dplyr 1.1.0 is out.

bwiernik commented 1 year ago

Just ran into this and wanted to note that the current documentation is very confusing because it says that expand() and family are data-masking, but then they don't work with data masking functions like across(), and the error message gives the seemingly contradictory

Must only be used inside data-masking verbs like `mutate()`, `filter()`, and `group_by()`.