machow / siuba

Python library for using dplyr like syntax with pandas and SQL
https://siuba.org
MIT License
1.16k stars 49 forks source link

Feature Request: `pivot_longer` and `pivot_wider` vs `gather` and `spread` #233

Open breichholf opened 4 years ago

breichholf commented 4 years ago

Hi,

taking you up on your tweet from a few months ago, to see what happens 😉

First off, hats off for starting this fantastic project. It has made me curious to give data analysis in python a go again, taking siuba for a spin.

As you might know, in tidyr 1.0.0 the new verbs pivot_long and pivot_wider were introduced, building on gather and spread respectively.

Having used tidy verse for a few years already I'm quite happy with the gather and spread verbs, but in line with Hadley Wickham agree that for (potential) new users pivot_longer and pivot_wider is probably more intuitive.

As far as I can tell from a cursory glance your implementation of gather matches the original tidyr implementation, whereas pivot_longer for instance offers additional parameters such as names_to and values_to, to easily set column names.

Cheers, Brian

machow commented 4 years ago

Hey @breichholf, thanks! This request is a little more ambitious than the last one (custom SQL joins), but I agree the functions are really useful :).

There's a lot of behavior in pivot_longer and pivot_wider, but a big leg up for siuba is that their unit tests are pretty simple and clear.

I'll take a quick pass at stubbing them out w/ the unit tests in the siuba.experimental module tomorrow, so I can chip away at them (and so other people can always jump in if they want ;)!

breichholf commented 4 years ago

Cool, I'd be happy to chip in where I can :)

Will keep an eye on the progress and would be keen to help out and get in touch when you get started.

machow commented 4 years ago

Alright--chucked up a stubbed out pivot_longer and also recorded my screen while in did, in case it's useful! PR here: https://github.com/machow/siuba/pull/237

If you're interested--it seems like there are roughly three nice areas to chip away at pivot_longer:

I've sketched out examples for the two research questions below! Any interest in tackling one of the things above? The third will be a bit tricky, but the upshot is that it taps into how pandas thinks about reshaping with multi-indexes (so is super handy to go through).

(useful tidyr pivot vignette here)

names_prefix (easy)

This case should be simple, since it is similar to gather, except we remove the beginning "wk" from measure variable names.

weeks <- tribble(
  ~id, ~wk1, ~wk2,
  'a', 6, 7,
  'b', 8, 9
)

pivot_longer(weeks, starts_with("wk"), names_to = "week", names_prefix = "wk")

output:

# A tibble: 4 x 3
  id    week  value
  <chr> <chr> <dbl>
1 a     1         6
2 a     2         7
3 b     1         8
4 b     2         9

names_sep (hard)

This example might happen if the numbers in the column names are something like weeks. So in the data below, x_1 is measure x at week 1, and y_2 is measure y over week 2. In the case below, ID might be a person.

df <- tribble(
  ~id, ~x_1, ~x_2, ~y_2,
  "A",    1,    2,  "a",
  "B",    3,    4,  "b",
)

pv <- pivot_longer(df, -id, names_to = c(".value", "n"), names_sep = "_")

pv

output:

# A tibble: 4 x 4
  id    n         x y    
  <chr> <chr> <dbl> <chr>
1 A     1         1 NA   
2 A     2         2 a    
3 B     1         3 NA   
4 B     2         4 b  

possible pandas implementations

I'm guessing this is won't require a lot of code, and will mostly involve the stack method. On the top level of the index would be x and y, on the second level would be n (guessing?)

breichholf commented 4 years ago

Great suggestions all around. Thanks for the detailed descriptions.

I'm usually super negligent about unit tests, so I want to say that I should tackle those (as well). I've used pandas on and off again already, so I'm thinking I might even want to tackle the harder implementation of names_sep.

I agree, either pd.DataFrame.stack or pd.DataFrame.melt will do the trick. I'm thinking pd.DataFrame.melt might provide an easier avenue to provide the names. I'll fork and take a stab at it over the weekend, and can give you a little update how I'm getting along after that 👍

machow commented 4 years ago

The names_sep one definitely seems like an interesting one to chase down--thanks for taking a look!

pd.DataFrame.melt might provide an easier avenue to provide the names.

gather uses DataFrame.melt under the hood, so definitely worth checking out the code. Here's some code producing the same result as names_sep using a combo of gather (uses DataFrame.melt) and spread (uses DataFrame.unstack).

I'm guessing there's some way to use DataFrame.stack to do it in 1 step, but am not totally sure!

import pandas as pd
from siuba import _, gather, separate, spread

df = pd.DataFrame({
    "id": ["A", "B"],
    "x_1": [1, 3],
    "x_2": [2, 4],
    "y_2": ["a", "b"]
})

long = gather(df, "measure", "value", -_.id)

(
    long
    >> separate(_.measure, ["variable", "n"])
    >> spread(_.variable, _.value)
)

Output:

  id  n  x    y
0  A  1  1  NaN
1  A  2  2    a
2  B  1  3  NaN
3  B  2  4    b
machow commented 4 years ago

Thanks for adding your PR--I was struggling to get a handle on how stack works as I went through it, so fiddled a bit. It seems like this reproduces the names_sep result! (just needs to be joined with id vars).

# continuing code from above...

# get only measure variables
measures = df.iloc[:, 1:] 

# create a multi index of their split names
indx_multi = measures.columns.str.split('_').map(tuple).set_names([None, 'n'])
indx_multi

# make columns a multi index
tmp = measures.copy()
tmp.columns = indx_multi 

# do the reshaping
# stack by default "stacks" level -1, which in this case is the last entry
# when we using something like some_col.str.split('_')
stacked = tmp.stack()
stacked.reset_index(level = -1)

output:

   n  x    y
0  1  1  NaN
0  2  2    a
1  1  3  NaN
1  2  4    b

Pretty crazy if it works--will take a closer look at the PR right now!

breichholf commented 4 years ago

Oh that looks cool! I was thinking about names_sep earlier today and completely glossed over just treating it as a character columns.str.split(). Ended up drifting off in the direction of thinking about regexes, so I definitely appreciate the point in the direction of this simple approach.

I have some spare time on my hands tomorrow, so I'll look in to getting that down then. Would be fantastic if it were that 'easy'.

breichholf commented 4 years ago

So I wanted to initially post this in the comment conversation as answer to the pivoting behaviour, but it ended up being a longer train of thought, so I hope it is more conducive for a discussion to have it in this issue.

I took your comment as an 'excuse' to have a (slightly) deeper look at how they wrote gather vs pivot_longer. Both have a cpp backend, but interestingly pivot_longer and pivot_wider use the same c function to pivot: vec_cbind (from Hadley's package vctrs). As far as I understand (admittedly only from a cursory glance) is that building the _spec allows them to pivot both to _longer and _wider using the same base (c) function. This essentially takes vectors as input and slices (something that pandas and siuba already does nicely).

It's interesting how they achieve it, and has made me think if we should at all bother building this spec dataframe. At first I thought spec was just a fancy way of avoiding the combination of names_to and names_sep/names_pattern, but it's actually much more convenient: it provides a 'reference' that you can use to revert between the long and short form.

Let me demonstrate, using the same example as you are using above, just in R:

> df
# A tibble: 2 x 4
  id      x_1   x_2 y_2  
  <chr> <dbl> <dbl> <chr>
1 A         1     2 a    
2 B         3     4 b
> spec <- build_longer_spec(df, -id, names_to = c(".value", "n"), values_to = "values", names_sep = "_")
> spec
# A tibble: 3 x 3
  .name .value n    
  <chr> <chr>  <chr>
1 x_1   x      1    
2 x_2   x      2    
3 y_2   y      2
> pv <- pivot_longer_spec(df, spec)
> pv
# A tibble: 4 x 4
  id    n         x y    
  <chr> <chr> <dbl> <chr>
1 A     1         1 NA   
2 A     2         2 a    
3 B     1         3 NA   
4 B     2         4 b
>  pivot_wider_spec(pv, spec)
# A tibble: 2 x 4
  id      x_1   x_2 y_2  
  <chr> <dbl> <dbl> <chr>
1 A         1     2 a    
2 B         3     4 b

So, when pivoting longer spec serves as a mapping of current column names (in .name) to their new column name in .value and their content in n (which was provided in names_to). You already demonstrated an elegant solution using a multi-indexed column.

For pivot_wider it's similar, but .name contains the new column names

Let's have a look at (my attempt) of achieving the same thing using your python example. It's a bit trickier as we have a multi index and we have NaNs. A caveat here is, that unstack doesn't have a dropna argument like stack does.

reverted_tmp = stacked.unstack().dropna(axis=1)
assert_frame_equal(tmp, reverted_tmp)

# After reseting the index it is slightly more complicated
stacked_reset = stacked.reset_index(level = -1)
reverse_multi_index = pd.MultiIndex.from_tuples(list(zip(stacked_reset.index.tolist(), stacked_reset.n)), names = [None, 'n'])
tmp_unstack = stacked_reset.loc[:,'x':'y']
tmp_unstack.index = reverse_multi_index

# Now we can unstack, to get back to tmp.
In [26]: tmp_unstack.unstack().dropna(axis=1)
Out[26]:
   x     y
n  1  2  2
0  1  2  a
1  3  4  b

# this passes now, too:
assert_frame_equal(tmp_unstack.unstack().dropna(axis=1), tmp)

I wonder if there is a more 'elegant' way of doing it, using a similar approach to R, with an intermediary spec dataframe.

I'm still mulling it over and will maybe try and tackle it in the next few evenings.

machow commented 4 years ago

edit: Moving my last 3 thoughts from the bottom up to the top!

  1. pivot_longer should be very easy to implement
  2. pivot_wider will be easy to implement using DataFrame.pivot_table if we don't mess too much with a custom aggfunc (maybe just let user pass it straight to pivot_table?)
  3. there's probably some very clever way to get a different aggfunc default, so it matches dplyr more closely, but will likely be a lot of work! :/ :/ :/

Ah, thanks for writing all of these notes! It's really helpful to see that the spec goes from wide to long, but also inverts from long to wide.

Based on your examples, I think I'm getting the gist of how pivot_wider and pivot_longer are connected. Fingers crossed, it should map pretty cleanly to pandas DataFrame.pivot_table.

Issues with unstack when need to aggregate values

The challenge with unstack, I think, is that it will raise an error for ...

from siuba.data import mtcars

id_cols = ['cyl']   
key_cols = ['gear']
val_cols = ['mpg']

mtcars.set_index(id_cols + key_cols).unstack(level = -1)

Because it wants to have 1 row per level of cylinders, and 1 column per gear, but there are multiple cars with e.g. 4 cylinders and 4 gears. pivot_table is like unstack, but you also tell it what to do in these cases of multiples.

using DataFrame.pivot_table

Going back to the spec example...

> spec
# A tibble: 3 x 3
  .name .value n    
  <chr> <chr>  <chr>
1 x_1   x      1    
2 x_2   x      2    
3 y_2   y      2

The DataFrame.pivot_table arguments would be...

so...

previous code ```python import pandas as pd from siuba import _, gather, separate, spread df = pd.DataFrame({ "id": ["A", "B"], "x_1": [1, 3], "x_2": [2, 4], "y_2": ["a", "b"] }) long = ( df >> gather("measure", "value", -_.id) >> separate(_.measure, ["variable", "n"]) >> spread(_.variable, _.value) ) long ```
> long
  id  n  x    y
0  A  1  1  NaN
1  A  2  2    a
2  B  1  3  NaN
3  B  2  4    b
wide_raw = long.pivot_table(values = ["x", "y"], columns = "n", index = ["id"], aggfunc = list)
wide_raw
      x           y     
n     1    2      1    2
id                      
A   [1]  [2]  [nan]  [a]
B   [3]  [4]  [nan]  [b]
# has multi-index, so replace with "_" joined
wide_raw.columns = list(map("_".join, wide_raw.columns)) 

wide = (
    wide_raw
    .reset_index()
    .rename_axis(columns = lambda x: None)
)
  id  x_1  x_2    y_1  y_2
0  A  [1]  [2]  [nan]  [a]
1  B  [3]  [4]  [nan]  [b]

Note that in the same way pivot_wider has an arg like values_fn=list(n=mean, ...), which decides how to aggregate the columns, pivot_table has an aggfunc arg.

dropping NA columns

A caveat here is, that unstack doesn't have a dropna argument like stack does.

Surprisingly, pivot_table does! It seems like in your example, the spec created from build_longer_spec is how pivot_wider knows not to create a y_1 column! If you do pivot_wider directly it creates y_1 as all NAs...

pivot_wider(pv, names_from = c(n), values_from = c(x, y))
# A tibble: 2 x 5
  id      x_1   x_2 y_1   y_2  
  <chr> <dbl> <dbl> <chr> <chr>
1 A         1     2 NA    a    
2 B         3     4 NA    b    

pivot_table idiosyncracies

pivot_table actually, AFAICT, has a pretty complex set of choices it might make--mostly depending on the kind of data it receives, and aggfuncs.

sometimes it ignores things it can't aggregate

For example...

# all object types
long.dtypes
id    object
n     object
x     object
y     object
dtype: object
# notice, now some columns are integers, and 1 is a string
converted = long.convert_dtypes()
converted.dtypes
id    string
n     string
x      Int64
y     string
dtype: object
# notice no y column in the output :/
converted.pivot_table(values = ["x", "y"], columns = "n", index = ["id"])  
    x   
n   1  2
id      
A   1  2
B   3  4

it probably has special, optimized approaches for certain functions

often, pandas treats certain functions faster. For example, if you do...

import numpy as np
from siuba.data import mtcars
mtcars.groupby('cyl').hp.apply(np.sum)

pandas uses kind of an intricate lookup table to figure out that it can run something np.sum in a fast way. It is surprisingly, sort of just hard-coded in a file. When these aggregations are applied, things likely go much faster! (and everything else much slower :/... )

This seems to also be the case for pivot_table's aggfunc arg, which uses "mean" by default.

Anyway, sorry, that was a lot of exploring things--I think my three impressions are that...

  1. pivot_longer should be very easy to implement
  2. pivot_wider will be easy to implement if we don't mess too much with a custom aggfunc (maybe just let user pass it straight to pivot_table?)
  3. there's probably some very clever way to get a different aggfunc default, so it matches dplyr more closely, but will likely be a lot of work!
breichholf commented 4 years ago

Thanks for that detailed exploration and the examples. It prompted me to muck around some more with pd.pivot_table, pd.DataFrame.pivot and pd.DataFrame.unstack.

The challenge with unstack, I think, is that it will raise an error [...] [b]ecause it wants to have 1 row per level of cylinders, and 1 column per gear, but there are multiple cars with e.g. 4 cylinders and 4 gears. pivot_table is like unstack, but you also tell it what to do in these cases of multiples.

Fully agree, while it conceptually is the opposite of stack(), the use case for pivot_table is almost identical to the tidyverse pivot_wider, and I would definitely go with pivot_table when porting pivot_wider.

Your description of the dtype behaviour with pivot_table is also a helpful primer for when we get to that. I'm usually a fan of keeping data types what they should be, but auto-casting with convert_dtypes(). Looking at tidyverse/dplyr when joining columns (eg using mutate) this might create problems.

One minor thing I just noticed: in the requirements.txt you have pandas==0.25.3 (which I installed in a venv, to test accordingly), but pd.DataFrame.convert_dtypes() is a new function in pandas 1.0.0 ;)

Back to next steps of pivot_longer for now

I'll make sure the tests run fine first, and will not bother implementing the same functionality that spec fulfils in the tidyverse for now. As we have seen it's not horribly complex, so hopefully it won't take too much longer, once I sit down and actually work on it.

breichholf commented 4 years ago

Good and bad news

I have been playing around with the currently failing tests, related to pivot_longer_spec. I guess you could categorise my progress as "minor successes, with an icing of frustration".

Let me explain, using some code, and maybe you have an idea where I'm going wrong. My aim was to work backwards. First manually generate the spec DataFrame, and see how I need to set up pivot_longer_spec for it to provide the same output as pivot_longer.

Minor disclaimer first off: I generated the spec DataFrames in R first and set them manually. Here's the first currently failing test test_spec_add_multi_columns:

# expected output from R:
# > df <- tibble(x = 1:2, y = 3:4)
# > sp <- tibble(.name = c("x", "y"), .value = "v", a = 1, b = 2)
# > pivot_longer_spec(df, sp)
# # A tibble: 4 x 3
#       a     b     v
#   <dbl> <dbl> <int>
# 1     1     2     1
# 2     1     2     3
# 3     1     2     2
# 4     1     2     4

df = data_frame(x = [1, 2], y = [3, 4])
sp = data_frame(_name = ['x', 'y'],
                _value = ['v'] * 2,
                a = [1, 1], b = [2, 2])

stacked = df.loc[:, sp.loc[:,'_name'].tolist()].stack()
value_name = sp.loc[:,'_value'].unique().tolist()[0]
output_col_order = sp.columns.tolist()[2:] + [value_name]
stacked = stacked.rename(value_name)

stacked.index = stacked.index.set_names([None, '_name'])
stacked_disordered = (
    stacked
    .reset_index(1)
    .merge(sp, on = "_name", how = "left")
    .drop(['_name', '_value'], 1)
)

output = stacked_disordered[order]
output
   a  b  v
0  1  2  1
1  1  2  3
2  1  2  2
3  1  2  4

So far so good 🎉 The test should pass (I haven't formally confirmed that yet, actually).

Now on for the trickier part

Unfortunately, spec is not always that 'simple'. The example we discussed above already points this out quite nicely:

> spec
# A tibble: 3 x 3
  .name .value n    
  <chr> <chr>  <chr>
1 x_1   x      1    
2 x_2   x      2    
3 y_2   y      2

This will not work in my code for pivot_longer_spec, as it falls apart starting in the following line:

value_name = sp.loc[:,'_value'].unique().tolist()[0]

Even if I change remove unique().tolist()[0] (which will break the first test, but I could single that out if all values are the same), using a list further down does not work anyway, as pd.Series.rename() accepts a string, a dict as mapper, but not a list. So, that's pretty much where I am stuck currently.

If you have any suggestions let me know and I can give it another go :)

Here's to hoping pivot_wider will be easier 😅