moodymudskipper / nakedpipe

Pipe Into a Sequence of Calls Without Repeating the Pipe Symbol.
69 stars 7 forks source link

pivoting #28

Open moodymudskipper opened 3 years ago

moodymudskipper commented 3 years ago

It's part of #29 .

I want to find a syntax that sees pivoting as a grouped operation and supports easier multi-spread or multi-gather ops.

moodymudskipper commented 3 years ago

Reshaping Theory

Reshaping has been done through various packages, the best known being {stats} (in base R), {reshape2}, and {tidyr}. {data.table} used to use {reshape2} and now has its own implementation of melt() and dcast() which behaves slightly differently but not enough to have its own item here.

There are 2 main types of reshaping, reshaping to wide, and reshaping to long.

reshaping wide

in our source table we have 4 types of columns :

The new columns will require names, they are in all existing solutions built from the values of the key columns.

Finally, we might have a way to aggregate the values when the sets of ids and key columns are not unique.

Because of this important number of parameters, all approaches use default values to reduce verbosity. Additionally starting from a given table we can deduce the columns of any 4th type knowing the 3 others, and if we postulate that we have no ignored columns we just need to give 2, all solutions use this information in different ways.

Other differences are :

pivot to wider function id columns key columns value column
unstack() ~ rhs lhs ~
reshape() idvar timevar v.names
dcast() lhs ~ ~ rhs value.var
spread() key value
pivot_wider() id_cols names_from values_from

reshaping long

When reshaping long, the key and value columns don't exist yet

pivot to longer function id columns key columns value column varying columns
stack() select
reshape() idvar timevar v.names varying
melt() id.vars variable.name value.name measure.vars
gather() key value ...
pivot_longer() id_cols names_to values_to cols

for reshape() in order to have column names in the key name we must also use times = names(.)[varying] (replacing varying by the value fed to varying).

moodymudskipper commented 3 years ago

Pivoting to wider

so, taking the simplest example from pivot_wider()'s doc :

fish_encounters
#> # A tibble: 114 x 3
#>    fish  station  seen
#>    <fct> <fct>   <int>
#>  1 4842  Release     1
#>  2 4842  I80_1       1
#>  3 4842  Lisbon      1
#>  4 4842  Rstr        1
#>  5 4842  Base_TD     1
#>  6 4842  BCE         1
#>  7 4842  BCW         1
#>  8 4842  BCE2        1
#>  9 4842  BCW2        1
#> 10 4842  MAE         1
#> # ... with 104 more rows
fish_encounters %>%
  pivot_wider(names_from = station, values_from = seen)
#> # A tibble: 19 x 12
#>    fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE   MAW
#>    <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int> <int>
#>  1 4842        1     1      1     1       1     1     1     1     1     1     1
#>  2 4843        1     1      1     1       1     1     1     1     1     1     1
#>  3 4844        1     1      1     1       1     1     1     1     1     1     1
#>  4 4845        1     1      1     1       1    NA    NA    NA    NA    NA    NA

We can write :

fish_encounters %.% {
  { "{station}" = seen } ~ fish
}

A few observations :

Let's see how we would handle the multi spread :

us_rent_income %>%
  pivot_wider(names_from = variable, values_from = c(estimate, moe))
#> # A tibble: 52 x 6
#>    GEOID NAME                 estimate_income estimate_rent moe_income moe_rent
#>    <chr> <chr>                          <dbl>         <dbl>      <dbl>    <dbl>
#>  1 01    Alabama                        24476           747        136        3
#>  2 02    Alaska                         32940          1200        508       13
#>  3 04    Arizona                        27517           972        148        4
#>  4 05    Arkansas                       23789           709        165        5
#>  5 06    California                     29454          1358        109        3
#>  6 08    Colorado                       32401          1125        109        5
#>  7 09    Connecticut                    35326          1123        195        5
#>  8 10    Delaware                       31560          1076        247       10
#>  9 11    District of Columbia           43198          1424        681       17
#> 10 12    Florida                        25952          1077         70        3
#> # ... with 42 more rows
us_rent_income %.% {
  {
    "estimate_{variable}" = estimate
    "estimate_{moe}" = moe
  } ~ (unused)
}

Our proposal is less general than pivot_wider() here, but I don't think there are many cases where one will want to spread many value columns. if we need it we could allow a data.frame on the rhs, and in that case we can do (though it would mean we can't spread a data.frame, which is another strange/rare thing to do):

us_rent_income %.% {
  { "{.value}_{variable}" = tibble(estimate, moe)   } ~ (unused)
  # or using our  `?selection`
  { "{.value}_{variable}" = ?c("estimate", "moe")   } ~ (unused)
}

Final example :

warpbreaks <- as_tibble(warpbreaks[c("wool", "tension", "breaks")])
warpbreaks
#> # A tibble: 54 x 3
#>    wool  tension breaks
#>    <fct> <fct>    <dbl>
#>  1 A     L           26
#>  2 A     L           30
#>  3 A     L           54
#>  4 A     L           25
#>  5 A     L           70
#>  6 A     L           52
#>  7 A     L           51
#>  8 A     L           26
#>  9 A     L           67
#> 10 A     M           18
#> # ... with 44 more rows
warpbreaks %>%
  pivot_wider(
    names_from = wool,
    values_from = breaks,
    values_fn = list(breaks = mean)
  )
#> # A tibble: 3 x 3
#>   tension     A     B
#>   <fct>   <dbl> <dbl>
#> 1 L        44.6  28.2
#> 2 M        24    28.8
#> 3 H        24.6  18.8

Here our solution is very straightforward :

warpbreaks %.% {
  { "{wool}" = mean(breaks) } ~ tension
}

with {tidyr} I think we cannot do the following without copying the breaks column first:

warpbreaks %.% {
  { 
    "mean_{wool}" = mean(breaks)
    "median_{wool}" = median(breaks)
   } ~ tension
  # or { "{.value}_{wool} = tibble(mean = mean(breaks), median = median(breaks)) } ~ tension
}

I looked at all the examples from reshape(), dcast() and spread() too, and I believe the solution proposed above can do everything apart from :

I don't have a clear idea on how to fill in NAs and it's a useful feature so I must keep thinking.

Here's a crazy idea, using a pseudo op =fill:<value>=

"mean_{wool}" =fill:0= mean(breaks)

Margins are not an issue specific to reshaping but specific to aggregation, tidyverse won't work on this because margins are not tidy (one row is one obs). But still they're sometimes useful. to be treated in another issue.

moodymudskipper commented 3 years ago

Pivoting to longer

We need : id columns, key columns, value column and varying columns

That's one more than to pivot to wider, and it makes the challenge much harder.

The simplest way is to use stack (if we want to stay in base R) :

relig_income %>%
  pivot_longer(
    -religion, 
    names_to = "income", 
    values_to = "count")

becomes :

relig_income %.% {
  stack() ~ religion
}

And then we rename our columns, because stack doesn't support custom names.

Note that we can summarize to more than one row per group, dplyr::summarize() allows it too (since v1). It's unlikely to mix this operation with other operations though, but it would be supported.

This is not satisfactory because pivot_longer() has much fancier features, and unfortunately, the fancier the harder to use/understand. So it'd be nice to have something more intuitive.

Let's try to rewrite the above with a candidate syntax :

relig_income %.% {
  { c("income", "count") = ?(".*") } ~ religion
}

We solve the problem of having one more type of column by using 2 on the lhs. I believe the order is intuitive, first the labels, then the values. ?(".*") ensures we consider all (non grouping) columns.


next example :

billboard %>%
 pivot_longer(
   cols = starts_with("wk"),
   names_to = "week",
   names_prefix = "wk",
   values_to = "rank",
   values_drop_na = TRUE
 )
#> # A tibble: 5,307 x 5
#>    artist  track                   date.entered week   rank
#>    <chr>   <chr>                   <date>       <chr> <dbl>
#>  1 2 Pac   Baby Don't Cry (Keep... 2000-02-26   1        87
#>  2 2 Pac   Baby Don't Cry (Keep... 2000-02-26   2        82
#>  3 2 Pac   Baby Don't Cry (Keep... 2000-02-26   3        72
#>  4 2 Pac   Baby Don't Cry (Keep... 2000-02-26   4        77
#>  5 2 Pac   Baby Don't Cry (Keep... 2000-02-26   5        87
#>  6 2 Pac   Baby Don't Cry (Keep... 2000-02-26   6        94
#>  7 2 Pac   Baby Don't Cry (Keep... 2000-02-26   7        99
#>  8 2Ge+her The Hardest Part Of ... 2000-09-02   1        91
#>  9 2Ge+her The Hardest Part Of ... 2000-09-02   2        87
#> 10 2Ge+her The Hardest Part Of ... 2000-09-02   3        92
#> # ... with 5,297 more rows

The issue here is we need to spot a column, and to rename it, we can get close with the following, using previous syntax, but we'll have to rename downstream, and filter out. The output is still arguably more readable, through I would have liked to do the renaming in the gathering operation.

relig_income %.% {
  { c("week", "rank") = ?("^wk") } ~ religion
  week = gsub("^wk", "", week)
  subset(!is.na(rank))
}

We can use regex and capture groups, we could support named capture groups in case of ambiguity :

relig_income %.% {
  { c("week", "rank") = ?("^wk_(.*?)") } ~ religion
  subset(!is.na(rank))
}

A difference is that all columns that match the regex are to be kept, while in principle, the cols argument could impose a subset first. In practice I think it will virtually never be an issue. If we had a good syntax for selection intersection we could leverage it here.


next example :

who %>% 
  pivot_longer(
  cols = new_sp_m014:newrel_f65,
  names_to = c("diagnosis", "gender", "age"),
  names_pattern = "new_?(.*)_(.)(.*)",
  values_to = "count"
)
#> # A tibble: 405,440 x 8
#>    country     iso2  iso3   year diagnosis gender age   count
#>    <chr>       <chr> <chr> <int> <chr>     <chr>  <chr> <int>
#>  1 Afghanistan AF    AFG    1980 sp        m      014      NA
#>  2 Afghanistan AF    AFG    1980 sp        m      1524     NA
#>  3 Afghanistan AF    AFG    1980 sp        m      2534     NA
#>  4 Afghanistan AF    AFG    1980 sp        m      3544     NA
#>  5 Afghanistan AF    AFG    1980 sp        m      4554     NA
#>  6 Afghanistan AF    AFG    1980 sp        m      5564     NA
#>  7 Afghanistan AF    AFG    1980 sp        m      65       NA
#>  8 Afghanistan AF    AFG    1980 sp        f      014      NA
#>  9 Afghanistan AF    AFG    1980 sp        f      1524     NA
#> 10 Afghanistan AF    AFG    1980 sp        f      2534     NA
#> # ... with 405,430 more rows

Here we are going to create several columns, it's quite straightforward following our last example :

who %.% {
  { 
    c("diagnosis", "gender", "age", "count") = ?("^new_(.*)_(.)(.*)") 
  } ~ (unused)
}

cols = new_sp_m014:newrel_f65 means here : take everything that satisfies the pattern, so it's actually a feature not to have to type it. If it was more restrictive, we'd have to select out the columns as a previous step.


last example :

This one is tricky because we want several value columns, pivot_longer() itself is not easy to understand here :

anscombe
#>    x1 x2 x3 x4    y1   y2    y3    y4
#> 1  10 10 10  8  8.04 9.14  7.46  6.58
#> 2   8  8  8  8  6.95 8.14  6.77  5.76
#> 3  13 13 13  8  7.58 8.74 12.74  7.71
#> 4   9  9  9  8  8.81 8.77  7.11  8.84
#> 5  11 11 11  8  8.33 9.26  7.81  8.47
#> 6  14 14 14  8  9.96 8.10  8.84  7.04
#> 7   6  6  6  8  7.24 6.13  6.08  5.25
#> 8   4  4  4 19  4.26 3.10  5.39 12.50
#> 9  12 12 12  8 10.84 9.13  8.15  5.56
#> 10  7  7  7  8  4.82 7.26  6.42  7.91
#> 11  5  5  5  8  5.68 4.74  5.73  6.89
anscombe %>%
 pivot_longer(everything(),
   names_to = c(".value", "set"),
   names_pattern = "(.)(.)"
 )
#> # A tibble: 44 x 3
#>    set       x     y
#>    <chr> <dbl> <dbl>
#>  1 1        10  8.04
#>  2 2        10  9.14
#>  3 3        10  7.46
#>  4 4         8  6.58
#>  5 1         8  6.95
#>  6 2         8  8.14
#>  7 3         8  6.77
#>  8 4         8  5.76
#>  9 1        13  7.58
#> 10 2        13  8.74
#> # ... with 34 more rows

We can borrow the idea and do :

anscombe %.% {
  { 
   c("set", ".value") = ?("(?<.value>.)(.)")  
  } ~ (none)
}

I believe we're better off imposing that the last name on the lhs be the value column(s), this means we have to use named capture to solve this one. but we don't have to name all groups, it can work like named argument to functions, we go by position after matching names.

moodymudskipper commented 3 years ago

some ambiguity remains


This looks like a summarizing call and i'll have to inspect the symbols to know it's a pivot op :

fish_encounters %.% {
  { "{station}" = seen } ~ fish
}

Although, strings are highlighter and it would look special, so maybe not that bad.


And this is weird too :

relig_income %.% {
  { c("income", "count") = ?(".*") } ~ religion
}

Here it's not ambiguous because the lhs is not a symbol, and not a call to ?.


Now would we understand right away that the above are pivot operations ? it would be nice to tag them in some way


We often click on [+] to expand and on [-] to collapse, so maybe we could use those ?

fish_encounters %.% {
  { "{station}" = -- seen } ~ fish
}

relig_income %.% {
  { c("income", "count") = ++ ?(".*") } ~ religion
}

maybe better:

fish_encounters %.% {
  { -- "{station}" = seen } ~ fish
}

relig_income %.% {
  { ++ c("income", "count") = ?(".*") } ~ religion
}

Or we can use a different symbol to signal we're pivoting, but we're a bit short on those, this doesn't look so good:

fish_encounters %.% {
  { "{station}" := seen } ~ fish
}

relig_income %.% {
  { c("income", "count") := ?(".*") } ~ religion
}

Or some infix, used for syntax only, or containing the actual logic :

fish_encounters %.% {
  { "{station}" %=wider%  seen } ~ fish
}

relig_income %.% {
  { c("income", "count") %=longer% ?(".*") } ~ religion
}

They don't look really good though, and I don't like using %% ops for assignment.

moodymudskipper commented 3 years ago

A couple months later I think to pivot wider and longer respectively those are fine :

fish_encounters %.% {
  { "{station}" = seen } ~ fish
}

relig_income %.% {
  { c("income", "count") = ?(".*") } ~ religion
}

I still think fill is needed for pivoting to wider though, what about :

fish_encounters %.% {
  { "{station}" ~ 0 = seen } ~ fish
}

This has the fill value on the lhs, where we expect only names, so not very intuitive.

or

fish_encounters %.% {
  { "{station}" = seen ~ 0} ~ fish
}

It uses the mutate by syntax, under a summarizing call, which is confusing.

or

fish_encounters %.% {
  { "{station}, fill = 0" = seen} ~ fish
}

This looks clunky for sure, but is readable enough. Can be shortened :

fish_encounters %.% {
  { "{station}/0" = seen} ~ fish
}

we can use '' (simple quotes) to use strings, not the best but maybe the best we can do.

To be compared with

fish_encounters %>%
  pivot_wider(names_from = station, values_from = seen, values_fill = 0)
moodymudskipper commented 3 years ago

more on the "fill" syntax.

I think reusing ~ creates unnecessary confusion, we have other operators, I think / and : look the best here

fish_encounters %.% {
  { "{station}" / 0 = seen } ~ fish
  { "{station}" : 0 = seen } ~ fish
}

We could also have a special pseudo operator, or consider | special in this position (might backfire)

fish_encounters %.% {
  { "{station}" = seen %fill% 0} ~ fish
  { "{station}" = seen | 0} ~ fish
  { "{station}" = seen = 0} ~ fish
  { "{station}" = seen ? 0} ~ fish
  { "{station}" = seen := 0} ~ fish
}

There's also an option to use () or []

fish_encounters %.% {
  { "{station}" (0) = seen} ~ fish
  { "{station}" [0] = seen} ~ fish
}
moodymudskipper commented 3 years ago

A crazy idea that would solve the ambiguity, we use :=, and for fill, we insert something between : and = :

fish_encounters %.% {
  { "{station}" := seen } ~ fish   # no fill
  { "{station}" :0= seen } ~ fish # fill with 0
}

We'd use := to pivot to longer too.