markfairbanks / tidytable

Tidy interface to 'data.table'
https://markfairbanks.github.io/tidytable/
Other
450 stars 32 forks source link

`pivot_wider()` with `id_cols` of length 0 and `unused_fn` #698

Closed Darxor closed 10 months ago

Darxor commented 1 year ago

Currently specifying character(0) / numeric(0) in id_cols leads to behavior similar to NULL (the default), but also produces a warning:

df <- data.frame(
  a   = LETTERS[1:2],
  b   = LETTERS[3:4],
  val = 1:2
)

df |>
  tidytable::pivot_wider(
    id_cols = character(0),
    names_from = a,
    values_from = val
  )
#> Warning in `[.data.table`(~.df, , `:=`(., NULL)): Column '.' does not exist to
#> remove
#> # A tidytable: 2 × 3
#>   b         A     B
#>   <chr> <int> <int>
#> 1 C         1    NA
#> 2 D        NA     2

Created on 2022-11-29 with reprex v2.0.2

{tidyr} handles it differently. They apply function passed in the argument unused_fn (default is to omit columns) to all columns not mentioned in id_cols, names_from, or values_from, which leads to this:

df <- data.frame(
  a   = LETTERS[1:2],
  b   = LETTERS[3:4],
  val = 1:2
)

df |>
  tidyr::pivot_wider(
    id_cols = character(0),
    names_from = a,
    values_from = val
  )
#> # A tibble: 1 × 2
#>       A     B
#>   <int> <int>
#> 1     1     2

Created on 2022-11-29 with reprex v2.0.2

Another note: with NULL in id_cols, {tidyr} considers all columns not mentioned by names_from, or values_from as id_cols, which affects id_expand (currently also not implemented, but a thing to keep in mind for future).

I can tackle this if you want, though this may lead to a rather big change in how pivot_wider() is coded 😅

markfairbanks commented 1 year ago

Interesting - I wasn't aware of this functionality.

I can tackle this if you want, though this may lead to a rather big change in how pivot_wider() is coded

Yeah if you want to take a shot at this feel free to.

I think this is the general approach we need. But let me know if I'm missing anything.

Also - this is sort of pseudo-code below, I don't know if this is exactly how it will work.

Once unused_cols are identified there are two parts: 1) If is.null(unused_fn) the columns are dropped from the data frame pre-pivoting (relatively straightforward) 2) If !is.null(unused_fn) we need to aggregate the unused_cols with the unused_fn

For part 2 - wouldn't we basically just need something like this?

(Probably with a conditional if statement)

unused_df <- select(.df, all_of(id_cols), all_of(unused_cols))
unused_df <- summarize(unused_df, across(all_of(unused_cols), unused_fn), .by = all_of(id_cols))

out <- bind_cols(out, unused_df)

# And maybe a relocate step to have the correct column order? Or maybe select?
# out <- select(out, all_of(id_cols), all_of(unused_cols), everything())
Darxor commented 1 year ago

Interesting - I wasn't aware of this functionality.

I think my approach with tidyverse sometimes relies on a lesser-known functionality, as you can see by my issues, haha.

I will look into this, yeah! Sounds about right with that approach, I will have to think about how to identify id_cols and unused_cols correctly (order of operation concerns me a bit). I've ran some tests from {tidyr} that seem related to this, and currently they are not passed.

BTW, I think its also a good idea to port some tests over from tidyverse and look for missing / non-matching / broken functionality between packages.

markfairbanks commented 10 months ago

All set - sorry this one took so long to get to!

library(tidytable)

df <- data.frame(
  a   = LETTERS[1:2],
  b   = LETTERS[3:4],
  val = 1:2
)

df %>%
  pivot_wider(
    id_cols = character(0),
    names_from = a,
    values_from = val
  )
#> # A tidytable: 1 × 2
#>       A     B
#>   <int> <int>
#> 1     1     2