Open moodymudskipper opened 3 years ago
Reshaping has been done through various packages, the best known being {stats} (in base R),
{reshape2}, and {tidyr}. {data.table} used to use {reshape2} and now has its own implementation of melt()
and dcast()
which behaves slightly differently but not enough to have its own item here.
There are 2 main types of reshaping, reshaping to wide, and reshaping to long.
in our source table we have 4 types of columns :
The new columns will require names, they are in all existing solutions built from the values of the key columns.
Finally, we might have a way to aggregate the values when the sets of ids and key columns are not unique.
Because of this important number of parameters, all approaches use default values to reduce verbosity. Additionally starting from a given table we can deduce the columns of any 4th type knowing the 3 others, and if we postulate that we have no ignored columns we just need to give 2, all solutions use this information in different ways.
Other differences are :
pivot to wider | function | id columns | key columns | value column |
---|---|---|---|---|
unstack() | ~ rhs | lhs ~ | ||
reshape() | idvar | timevar | v.names | |
dcast() | lhs ~ | ~ rhs | value.var | |
spread() | key | value | ||
pivot_wider() | id_cols | names_from | values_from |
When reshaping long, the key and value columns don't exist yet
pivot to longer | function | id columns | key columns | value column | varying columns |
---|---|---|---|---|---|
stack() | select | ||||
reshape() | idvar | timevar | v.names | varying | |
melt() | id.vars | variable.name | value.name | measure.vars | |
gather() | key | value | ... | ||
pivot_longer() | id_cols | names_to | values_to | cols |
for reshape()
in order to have column names in the key name we must also use times = names(.)[varying]
(replacing varying
by the value fed to varying).
Pivoting to wider
~
=
=
so, taking the simplest example from pivot_wider()
's doc :
fish_encounters
#> # A tibble: 114 x 3
#> fish station seen
#> <fct> <fct> <int>
#> 1 4842 Release 1
#> 2 4842 I80_1 1
#> 3 4842 Lisbon 1
#> 4 4842 Rstr 1
#> 5 4842 Base_TD 1
#> 6 4842 BCE 1
#> 7 4842 BCW 1
#> 8 4842 BCE2 1
#> 9 4842 BCW2 1
#> 10 4842 MAE 1
#> # ... with 104 more rows
fish_encounters %>%
pivot_wider(names_from = station, values_from = seen)
#> # A tibble: 19 x 12
#> fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
#> <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 4842 1 1 1 1 1 1 1 1 1 1 1
#> 2 4843 1 1 1 1 1 1 1 1 1 1 1
#> 3 4844 1 1 1 1 1 1 1 1 1 1 1
#> 4 4845 1 1 1 1 1 NA NA NA NA NA NA
We can write :
fish_encounters %.% {
{ "{station}" = seen } ~ fish
}
A few observations :
pivot_wider()
solution, fish
is implicit, we could also have { "{station}" = seen } ~ (unused)
(I don't like { "{station}" = seen } ~ .
because the dot is already used a lot for other things).pivot_wider
has values_fill
to replace the NAs, I don't know how to do it here"station_{station}" =
(as we do below)."spread_{col1}_{col2}" =
so no need to unite()
beforehand Let's see how we would handle the multi spread :
us_rent_income %>%
pivot_wider(names_from = variable, values_from = c(estimate, moe))
#> # A tibble: 52 x 6
#> GEOID NAME estimate_income estimate_rent moe_income moe_rent
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 01 Alabama 24476 747 136 3
#> 2 02 Alaska 32940 1200 508 13
#> 3 04 Arizona 27517 972 148 4
#> 4 05 Arkansas 23789 709 165 5
#> 5 06 California 29454 1358 109 3
#> 6 08 Colorado 32401 1125 109 5
#> 7 09 Connecticut 35326 1123 195 5
#> 8 10 Delaware 31560 1076 247 10
#> 9 11 District of Columbia 43198 1424 681 17
#> 10 12 Florida 25952 1077 70 3
#> # ... with 42 more rows
us_rent_income %.% {
{
"estimate_{variable}" = estimate
"estimate_{moe}" = moe
} ~ (unused)
}
Our proposal is less general than pivot_wider()
here, but I don't think there are many cases where one will want to spread many value columns. if we need it we could allow a data.frame on the rhs, and in that case we can do (though it would mean we can't spread a data.frame, which is another strange/rare thing to do):
us_rent_income %.% {
{ "{.value}_{variable}" = tibble(estimate, moe) } ~ (unused)
# or using our `?selection`
{ "{.value}_{variable}" = ?c("estimate", "moe") } ~ (unused)
}
Final example :
warpbreaks <- as_tibble(warpbreaks[c("wool", "tension", "breaks")])
warpbreaks
#> # A tibble: 54 x 3
#> wool tension breaks
#> <fct> <fct> <dbl>
#> 1 A L 26
#> 2 A L 30
#> 3 A L 54
#> 4 A L 25
#> 5 A L 70
#> 6 A L 52
#> 7 A L 51
#> 8 A L 26
#> 9 A L 67
#> 10 A M 18
#> # ... with 44 more rows
warpbreaks %>%
pivot_wider(
names_from = wool,
values_from = breaks,
values_fn = list(breaks = mean)
)
#> # A tibble: 3 x 3
#> tension A B
#> <fct> <dbl> <dbl>
#> 1 L 44.6 28.2
#> 2 M 24 28.8
#> 3 H 24.6 18.8
Here our solution is very straightforward :
warpbreaks %.% {
{ "{wool}" = mean(breaks) } ~ tension
}
with {tidyr} I think we cannot do the following without copying the breaks column first:
warpbreaks %.% {
{
"mean_{wool}" = mean(breaks)
"median_{wool}" = median(breaks)
} ~ tension
# or { "{.value}_{wool} = tibble(mean = mean(breaks), median = median(breaks)) } ~ tension
}
I looked at all the examples from reshape()
, dcast()
and spread()
too, and I believe the solution proposed above can do everything apart from :
I don't have a clear idea on how to fill in NAs and it's a useful feature so I must keep thinking.
Here's a crazy idea, using a pseudo op =fill:<value>=
"mean_{wool}" =fill:0= mean(breaks)
Margins are not an issue specific to reshaping but specific to aggregation, tidyverse won't work on this because margins are not tidy (one row is one obs). But still they're sometimes useful. to be treated in another issue.
Pivoting to longer
We need : id columns, key columns, value column and varying columns
That's one more than to pivot to wider, and it makes the challenge much harder.
The simplest way is to use stack (if we want to stay in base R) :
relig_income %>%
pivot_longer(
-religion,
names_to = "income",
values_to = "count")
becomes :
relig_income %.% {
stack() ~ religion
}
And then we rename our columns, because stack doesn't support custom names.
Note that we can summarize to more than one row per group, dplyr::summarize()
allows it too (since v1). It's unlikely to mix this operation with other operations though, but it would be supported.
This is not satisfactory because pivot_longer()
has much fancier features, and unfortunately, the fancier the harder to use/understand. So it'd be nice to have something more intuitive.
Let's try to rewrite the above with a candidate syntax :
relig_income %.% {
{ c("income", "count") = ?(".*") } ~ religion
}
We solve the problem of having one more type of column by using 2 on the lhs. I believe the order is intuitive, first the labels, then the values. ?(".*")
ensures we consider all (non grouping) columns.
next example :
billboard %>%
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
names_prefix = "wk",
values_to = "rank",
values_drop_na = TRUE
)
#> # A tibble: 5,307 x 5
#> artist track date.entered week rank
#> <chr> <chr> <date> <chr> <dbl>
#> 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87
#> 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82
#> 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72
#> 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77
#> 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 5 87
#> 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 6 94
#> 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 7 99
#> 8 2Ge+her The Hardest Part Of ... 2000-09-02 1 91
#> 9 2Ge+her The Hardest Part Of ... 2000-09-02 2 87
#> 10 2Ge+her The Hardest Part Of ... 2000-09-02 3 92
#> # ... with 5,297 more rows
The issue here is we need to spot a column, and to rename it, we can get close with the following, using previous syntax, but we'll have to rename downstream, and filter out. The output is still arguably more readable, through I would have liked to do the renaming in the gathering operation.
relig_income %.% {
{ c("week", "rank") = ?("^wk") } ~ religion
week = gsub("^wk", "", week)
subset(!is.na(rank))
}
We can use regex and capture groups, we could support named capture groups in case of ambiguity :
relig_income %.% {
{ c("week", "rank") = ?("^wk_(.*?)") } ~ religion
subset(!is.na(rank))
}
A difference is that all columns that match the regex are to be kept, while in principle, the cols
argument could impose a subset first. In practice I think it will virtually never be an issue. If we had a good syntax for selection intersection we could leverage it here.
next example :
who %>%
pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = c("diagnosis", "gender", "age"),
names_pattern = "new_?(.*)_(.)(.*)",
values_to = "count"
)
#> # A tibble: 405,440 x 8
#> country iso2 iso3 year diagnosis gender age count
#> <chr> <chr> <chr> <int> <chr> <chr> <chr> <int>
#> 1 Afghanistan AF AFG 1980 sp m 014 NA
#> 2 Afghanistan AF AFG 1980 sp m 1524 NA
#> 3 Afghanistan AF AFG 1980 sp m 2534 NA
#> 4 Afghanistan AF AFG 1980 sp m 3544 NA
#> 5 Afghanistan AF AFG 1980 sp m 4554 NA
#> 6 Afghanistan AF AFG 1980 sp m 5564 NA
#> 7 Afghanistan AF AFG 1980 sp m 65 NA
#> 8 Afghanistan AF AFG 1980 sp f 014 NA
#> 9 Afghanistan AF AFG 1980 sp f 1524 NA
#> 10 Afghanistan AF AFG 1980 sp f 2534 NA
#> # ... with 405,430 more rows
Here we are going to create several columns, it's quite straightforward following our last example :
who %.% {
{
c("diagnosis", "gender", "age", "count") = ?("^new_(.*)_(.)(.*)")
} ~ (unused)
}
cols = new_sp_m014:newrel_f65
means here : take everything that satisfies the pattern, so it's actually a feature not to have to type it. If it was more restrictive, we'd have to select out the columns as a previous step.
last example :
This one is tricky because we want several value columns, pivot_longer()
itself is not easy to understand here :
anscombe
#> x1 x2 x3 x4 y1 y2 y3 y4
#> 1 10 10 10 8 8.04 9.14 7.46 6.58
#> 2 8 8 8 8 6.95 8.14 6.77 5.76
#> 3 13 13 13 8 7.58 8.74 12.74 7.71
#> 4 9 9 9 8 8.81 8.77 7.11 8.84
#> 5 11 11 11 8 8.33 9.26 7.81 8.47
#> 6 14 14 14 8 9.96 8.10 8.84 7.04
#> 7 6 6 6 8 7.24 6.13 6.08 5.25
#> 8 4 4 4 19 4.26 3.10 5.39 12.50
#> 9 12 12 12 8 10.84 9.13 8.15 5.56
#> 10 7 7 7 8 4.82 7.26 6.42 7.91
#> 11 5 5 5 8 5.68 4.74 5.73 6.89
anscombe %>%
pivot_longer(everything(),
names_to = c(".value", "set"),
names_pattern = "(.)(.)"
)
#> # A tibble: 44 x 3
#> set x y
#> <chr> <dbl> <dbl>
#> 1 1 10 8.04
#> 2 2 10 9.14
#> 3 3 10 7.46
#> 4 4 8 6.58
#> 5 1 8 6.95
#> 6 2 8 8.14
#> 7 3 8 6.77
#> 8 4 8 5.76
#> 9 1 13 7.58
#> 10 2 13 8.74
#> # ... with 34 more rows
We can borrow the idea and do :
anscombe %.% {
{
c("set", ".value") = ?("(?<.value>.)(.)")
} ~ (none)
}
I believe we're better off imposing that the last name on the lhs be the value column(s), this means we have to use named capture to solve this one. but we don't have to name all groups, it can work like named argument to functions, we go by position after matching names.
some ambiguity remains
This looks like a summarizing call and i'll have to inspect the symbols to know it's a pivot op :
fish_encounters %.% {
{ "{station}" = seen } ~ fish
}
Although, strings are highlighter and it would look special, so maybe not that bad.
And this is weird too :
relig_income %.% {
{ c("income", "count") = ?(".*") } ~ religion
}
Here it's not ambiguous because the lhs is not a symbol, and not a call to ?
.
Now would we understand right away that the above are pivot operations ? it would be nice to tag them in some way
We often click on [+] to expand and on [-] to collapse, so maybe we could use those ?
fish_encounters %.% {
{ "{station}" = -- seen } ~ fish
}
relig_income %.% {
{ c("income", "count") = ++ ?(".*") } ~ religion
}
maybe better:
fish_encounters %.% {
{ -- "{station}" = seen } ~ fish
}
relig_income %.% {
{ ++ c("income", "count") = ?(".*") } ~ religion
}
Or we can use a different symbol to signal we're pivoting, but we're a bit short on those, this doesn't look so good:
fish_encounters %.% {
{ "{station}" := seen } ~ fish
}
relig_income %.% {
{ c("income", "count") := ?(".*") } ~ religion
}
Or some infix, used for syntax only, or containing the actual logic :
fish_encounters %.% {
{ "{station}" %=wider% seen } ~ fish
}
relig_income %.% {
{ c("income", "count") %=longer% ?(".*") } ~ religion
}
They don't look really good though, and I don't like using %%
ops for assignment.
A couple months later I think to pivot wider and longer respectively those are fine :
fish_encounters %.% {
{ "{station}" = seen } ~ fish
}
relig_income %.% {
{ c("income", "count") = ?(".*") } ~ religion
}
I still think fill
is needed for pivoting to wider though, what about :
fish_encounters %.% {
{ "{station}" ~ 0 = seen } ~ fish
}
This has the fill value on the lhs, where we expect only names, so not very intuitive.
or
fish_encounters %.% {
{ "{station}" = seen ~ 0} ~ fish
}
It uses the mutate by syntax, under a summarizing call, which is confusing.
or
fish_encounters %.% {
{ "{station}, fill = 0" = seen} ~ fish
}
This looks clunky for sure, but is readable enough. Can be shortened :
fish_encounters %.% {
{ "{station}/0" = seen} ~ fish
}
we can use ''
(simple quotes) to use strings, not the best but maybe the best we can do.
To be compared with
fish_encounters %>%
pivot_wider(names_from = station, values_from = seen, values_fill = 0)
more on the "fill" syntax.
I think reusing ~
creates unnecessary confusion, we have other operators, I think /
and :
look the best here
fish_encounters %.% {
{ "{station}" / 0 = seen } ~ fish
{ "{station}" : 0 = seen } ~ fish
}
We could also have a special pseudo operator, or consider |
special in this position (might backfire)
fish_encounters %.% {
{ "{station}" = seen %fill% 0} ~ fish
{ "{station}" = seen | 0} ~ fish
{ "{station}" = seen = 0} ~ fish
{ "{station}" = seen ? 0} ~ fish
{ "{station}" = seen := 0} ~ fish
}
There's also an option to use ()
or []
fish_encounters %.% {
{ "{station}" (0) = seen} ~ fish
{ "{station}" [0] = seen} ~ fish
}
A crazy idea that would solve the ambiguity, we use :=
, and for fill, we insert something between :
and =
:
fish_encounters %.% {
{ "{station}" := seen } ~ fish # no fill
{ "{station}" :0= seen } ~ fish # fill with 0
}
We'd use :=
to pivot to longer too.
It's part of #29 .
I want to find a syntax that sees pivoting as a grouped operation and supports easier multi-spread or multi-gather ops.