markfairbanks / tidytable

Tidy interface to 'data.table'
https://markfairbanks.github.io/tidytable/
Other
450 stars 32 forks source link

Output of pivot_wider is sorted even when names_sort = FALSE #705

Closed AltfunsMA closed 1 year ago

AltfunsMA commented 1 year ago

I would expect the tidytable version to leave new columns and the unmodified keys in the order it found them, like the tidyr version.

df <- data.frame(key = sample(LETTERS, size = length(LETTERS)),
                 other_key = sample(letters, size = length(LETTERS)),
                 value = sample(1:10, size = length(LETTERS), replace = T))

tidyr::pivot_wider(df, names_from = other_key, values_from = value)
#> # A tibble: 26 x 27
#>    key       n     v     r     g     c     j     k     w     t     i     f     q
#>    <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#>  1 Y         7    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
#>  2 D        NA     2    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
#>  3 G        NA    NA     2    NA    NA    NA    NA    NA    NA    NA    NA    NA
#>  4 P        NA    NA    NA     4    NA    NA    NA    NA    NA    NA    NA    NA
#>  5 V        NA    NA    NA    NA     6    NA    NA    NA    NA    NA    NA    NA
#>  6 W        NA    NA    NA    NA    NA     9    NA    NA    NA    NA    NA    NA
#>  7 B        NA    NA    NA    NA    NA    NA    10    NA    NA    NA    NA    NA
#>  8 S        NA    NA    NA    NA    NA    NA    NA     9    NA    NA    NA    NA
#>  9 T        NA    NA    NA    NA    NA    NA    NA    NA     4    NA    NA    NA
#> 10 M        NA    NA    NA    NA    NA    NA    NA    NA    NA     7    NA    NA
#> # ... with 16 more rows, and 14 more variables: p <int>, m <int>, h <int>,
#> #   y <int>, l <int>, s <int>, a <int>, x <int>, z <int>, b <int>, o <int>,
#> #   e <int>, d <int>, u <int>

tidytable::pivot_wider(df, names_from = other_key, values_from = value)
#> # A tidytable: 26 x 27
#>    key       a     b     c     d     e     f     g     h     i     j     k     l
#>    <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#>  1 A        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
#>  2 B        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    10    NA
#>  3 C        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
#>  4 D        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
#>  5 E        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
#>  6 F        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
#>  7 G        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
#>  8 H        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
#>  9 I        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
#> 10 J        NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA     7
#> # ... with 16 more rows, and 14 more variables: m <int>, n <int>, o <int>,
#> #   p <int>, q <int>, r <int>, s <int>, t <int>, u <int>, v <int>, w <int>,
#> #   x <int>, y <int>, z <int>
markfairbanks commented 1 year ago

Thanks for catching this, I'll take a look.

markfairbanks commented 1 year ago

Smaller reprex:

pacman::p_load(tidytable)

df <- tidytable(id = 1, names = c("b", "c", "a"), values = c(2, 3, 1))

df %>%
  pivot_wider(names_from = names, values_from = values)
#> # A tidytable: 1 × 4
#>      id     a     b     c
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     1     1     2     3
markfairbanks commented 1 year ago

@AltfunsMA this is now fixed.

FYI the "key" column will always be sorted due to how data.table handles pivoting.

However the order of the resulting columns will now be preserved.

pacman::p_load(tidytable)
set.seed(123)

df <- data.frame(key = sample(letters[1:3], 3),
                 other_key = sample(letters[1:3], 3),
                 value = sample(1:3, 3))

tidyr::pivot_wider(df, names_from = other_key, values_from = value)
#> # A tibble: 3 × 4
#>   key       b     a     c
#>   <chr> <int> <int> <int>
#> 1 c         2    NA    NA
#> 2 a        NA     3    NA
#> 3 b        NA    NA     1

pivot_wider(df, names_from = other_key, values_from = value)
#> # A tidytable: 3 × 4
#>   key       b     a     c
#>   <chr> <int> <int> <int>
#> 1 a        NA     3    NA
#> 2 b        NA    NA     1
#> 3 c         2    NA    NA
AltfunsMA commented 1 year ago

Thanks @markfairbanks! I'm using the following workaround for this particular case as I require the "key" column to be in the original order as well. If you've got a better idea, please do let me know!

pacman::p_load(tidytable) # Prior to fix in this issue

set.seed(123)

df <- data.frame(key = sample(LETTERS, size = 5),
                 other_key = sample(letters, size = 5),
                 value = sample(1:10, size = 5))

ord <- select(df, key)

pivoted <- pivot_wider(df, names_from = other_key, values_from = value)

left_join(ord, pivoted)

#> # A tidytable: 5 x 6
#>   key       e     k     r     t     v
#>   <chr> <int> <int> <int> <int> <int>
#> 1 O        NA    NA     6    NA    NA
#> 2 S        NA    NA    NA    NA     9
#> 3 N        NA    10    NA    NA    NA
#> 4 C         5    NA    NA    NA    NA
#> 5 J        NA    NA    NA     3    NA

tidyr::pivot_wider(df, names_from = other_key, values_from = value)

#> # A tibble: 5 x 6
#>   key       r     v     k     e     t
#>   <chr> <int> <int> <int> <int> <int>
#> 1 O         6    NA    NA    NA    NA
#> 2 S        NA     9    NA    NA    NA
#> 3 N        NA    NA    10    NA    NA
#> 4 C        NA    NA    NA     5    NA
#> 5 J        NA    NA    NA    NA     3
markfairbanks commented 1 year ago

One way would be to convert "key" to a factor using factor(key, levels = unique(key)). Afterwards you could convert it back to character using as.character(key) if you needed.

It should be much faster than using a left_join() on large datasets.

pacman::p_load(tidytable)
set.seed(123)

df <- data.frame(key = sample(letters[1:3], 3),
                 other_key = sample(letters[1:3], 3),
                 value = sample(1:3, 3))

df %>%
  mutate(key = factor(key, unique(key))) %>%
  pivot_wider(names_from = other_key, values_from = value)
#> # A tidytable: 3 × 4
#>   key       b     a     c
#>   <fct> <int> <int> <int>
#> 1 c         2    NA    NA
#> 2 a        NA     3    NA
#> 3 b        NA    NA     1

FYI you could also use vctrs::vec_cast(key, factor()) which assumes you want to keep the levels in order.