Open breichholf opened 4 years ago
Hey @breichholf, thanks! This request is a little more ambitious than the last one (custom SQL joins), but I agree the functions are really useful :).
There's a lot of behavior in pivot_longer and pivot_wider, but a big leg up for siuba is that their unit tests are pretty simple and clear.
I'll take a quick pass at stubbing them out w/ the unit tests in the siuba.experimental module tomorrow, so I can chip away at them (and so other people can always jump in if they want ;)!
Cool, I'd be happy to chip in where I can :)
Will keep an eye on the progress and would be keen to help out and get in touch when you get started.
Alright--chucked up a stubbed out pivot_longer
and also recorded my screen while in did, in case it's useful! PR here: https://github.com/machow/siuba/pull/237
If you're interested--it seems like there are roughly three nice areas to chip away at pivot_longer:
I've sketched out examples for the two research questions below! Any interest in tackling one of the things above? The third will be a bit tricky, but the upshot is that it taps into how pandas thinks about reshaping with multi-indexes (so is super handy to go through).
(useful tidyr pivot vignette here)
This case should be simple, since it is similar to gather, except we remove the beginning "wk"
from measure variable names.
weeks <- tribble(
~id, ~wk1, ~wk2,
'a', 6, 7,
'b', 8, 9
)
pivot_longer(weeks, starts_with("wk"), names_to = "week", names_prefix = "wk")
output:
# A tibble: 4 x 3
id week value
<chr> <chr> <dbl>
1 a 1 6
2 a 2 7
3 b 1 8
4 b 2 9
This example might happen if the numbers in the column names are something like weeks. So in the data below, x_1
is measure x at week 1, and y_2
is measure y over week 2. In the case below, ID might be a person.
df <- tribble(
~id, ~x_1, ~x_2, ~y_2,
"A", 1, 2, "a",
"B", 3, 4, "b",
)
pv <- pivot_longer(df, -id, names_to = c(".value", "n"), names_sep = "_")
pv
output:
# A tibble: 4 x 4
id n x y
<chr> <chr> <dbl> <chr>
1 A 1 1 NA
2 A 2 2 a
3 B 1 3 NA
4 B 2 4 b
possible pandas implementations
I'm guessing this is won't require a lot of code, and will mostly involve the stack method. On the top level of the index would be x
and y
, on the second level would be n
(guessing?)
Great suggestions all around. Thanks for the detailed descriptions.
I'm usually super negligent about unit tests, so I want to say that I should tackle those (as well). I've used pandas on and off again already, so I'm thinking I might even want to tackle the harder implementation of names_sep
.
I agree, either pd.DataFrame.stack
or pd.DataFrame.melt
will do the trick. I'm thinking pd.DataFrame.melt
might provide an easier avenue to provide the names.
I'll fork and take a stab at it over the weekend, and can give you a little update how I'm getting along after that 👍
The names_sep
one definitely seems like an interesting one to chase down--thanks for taking a look!
pd.DataFrame.melt might provide an easier avenue to provide the names.
gather uses DataFrame.melt under the hood, so definitely worth checking out the code. Here's some code producing the same result as names_sep
using a combo of gather (uses DataFrame.melt) and spread (uses DataFrame.unstack).
I'm guessing there's some way to use DataFrame.stack to do it in 1 step, but am not totally sure!
import pandas as pd
from siuba import _, gather, separate, spread
df = pd.DataFrame({
"id": ["A", "B"],
"x_1": [1, 3],
"x_2": [2, 4],
"y_2": ["a", "b"]
})
long = gather(df, "measure", "value", -_.id)
(
long
>> separate(_.measure, ["variable", "n"])
>> spread(_.variable, _.value)
)
Output:
id n x y
0 A 1 1 NaN
1 A 2 2 a
2 B 1 3 NaN
3 B 2 4 b
Thanks for adding your PR--I was struggling to get a handle on how stack works as I went through it, so fiddled a bit. It seems like this reproduces the names_sep result! (just needs to be joined with id vars).
# continuing code from above...
# get only measure variables
measures = df.iloc[:, 1:]
# create a multi index of their split names
indx_multi = measures.columns.str.split('_').map(tuple).set_names([None, 'n'])
indx_multi
# make columns a multi index
tmp = measures.copy()
tmp.columns = indx_multi
# do the reshaping
# stack by default "stacks" level -1, which in this case is the last entry
# when we using something like some_col.str.split('_')
stacked = tmp.stack()
stacked.reset_index(level = -1)
output:
n x y
0 1 1 NaN
0 2 2 a
1 1 3 NaN
1 2 4 b
Pretty crazy if it works--will take a closer look at the PR right now!
Oh that looks cool! I was thinking about names_sep
earlier today and completely glossed over just treating it as a character columns.str.split()
. Ended up drifting off in the direction of thinking about regexes, so I definitely appreciate the point in the direction of this simple approach.
I have some spare time on my hands tomorrow, so I'll look in to getting that down then. Would be fantastic if it were that 'easy'.
So I wanted to initially post this in the comment conversation as answer to the pivoting behaviour, but it ended up being a longer train of thought, so I hope it is more conducive for a discussion to have it in this issue.
I took your comment as an 'excuse' to have a (slightly) deeper look at how they wrote gather
vs pivot_longer
. Both have a cpp backend, but interestingly pivot_longer
and pivot_wider
use the same c function to pivot: vec_cbind
(from Hadley's package vctrs).
As far as I understand (admittedly only from a cursory glance) is that building the _spec
allows them to pivot both to _longer
and _wider
using the same base (c
) function. This essentially takes vectors as input and slices (something that pandas and siuba already does nicely).
It's interesting how they achieve it, and has made me think if we should at all bother building this spec dataframe.
At first I thought spec
was just a fancy way of avoiding the combination of names_to
and names_sep
/names_pattern
, but it's actually much more convenient: it provides a 'reference' that you can use to revert between the long and short form.
Let me demonstrate, using the same example as you are using above, just in R:
> df
# A tibble: 2 x 4
id x_1 x_2 y_2
<chr> <dbl> <dbl> <chr>
1 A 1 2 a
2 B 3 4 b
> spec <- build_longer_spec(df, -id, names_to = c(".value", "n"), values_to = "values", names_sep = "_")
> spec
# A tibble: 3 x 3
.name .value n
<chr> <chr> <chr>
1 x_1 x 1
2 x_2 x 2
3 y_2 y 2
> pv <- pivot_longer_spec(df, spec)
> pv
# A tibble: 4 x 4
id n x y
<chr> <chr> <dbl> <chr>
1 A 1 1 NA
2 A 2 2 a
3 B 1 3 NA
4 B 2 4 b
> pivot_wider_spec(pv, spec)
# A tibble: 2 x 4
id x_1 x_2 y_2
<chr> <dbl> <dbl> <chr>
1 A 1 2 a
2 B 3 4 b
So, when pivoting longer spec
serves as a mapping of current column names (in .name
) to their new column name in .value
and their content in n
(which was provided in names_to
). You already demonstrated an elegant solution using a multi-indexed column.
For pivot_wider
it's similar, but .name
contains the new column names
Let's have a look at (my attempt) of achieving the same thing using your python example. It's a bit trickier as we have a multi index and we have NaNs. A caveat here is, that unstack
doesn't have a dropna
argument like stack
does.
reverted_tmp = stacked.unstack().dropna(axis=1)
assert_frame_equal(tmp, reverted_tmp)
# After reseting the index it is slightly more complicated
stacked_reset = stacked.reset_index(level = -1)
reverse_multi_index = pd.MultiIndex.from_tuples(list(zip(stacked_reset.index.tolist(), stacked_reset.n)), names = [None, 'n'])
tmp_unstack = stacked_reset.loc[:,'x':'y']
tmp_unstack.index = reverse_multi_index
# Now we can unstack, to get back to tmp.
In [26]: tmp_unstack.unstack().dropna(axis=1)
Out[26]:
x y
n 1 2 2
0 1 2 a
1 3 4 b
# this passes now, too:
assert_frame_equal(tmp_unstack.unstack().dropna(axis=1), tmp)
I wonder if there is a more 'elegant' way of doing it, using a similar approach to R, with an intermediary spec
dataframe.
I'm still mulling it over and will maybe try and tackle it in the next few evenings.
edit: Moving my last 3 thoughts from the bottom up to the top!
DataFrame.pivot_table
if we don't mess too much with a custom aggfunc (maybe just let user pass it straight to pivot_table?)Ah, thanks for writing all of these notes! It's really helpful to see that the spec goes from wide to long, but also inverts from long to wide.
Based on your examples, I think I'm getting the gist of how pivot_wider and pivot_longer are connected. Fingers crossed, it should map pretty cleanly to pandas DataFrame.pivot_table
.
The challenge with unstack, I think, is that it will raise an error for ...
from siuba.data import mtcars
id_cols = ['cyl']
key_cols = ['gear']
val_cols = ['mpg']
mtcars.set_index(id_cols + key_cols).unstack(level = -1)
Because it wants to have 1 row per level of cylinders, and 1 column per gear, but there are multiple cars with e.g. 4 cylinders and 4 gears. pivot_table
is like unstack, but you also tell it what to do in these cases of multiples.
Going back to the spec example...
> spec
# A tibble: 3 x 3
.name .value n
<chr> <chr> <chr>
1 x_1 x 1
2 x_2 x 2
3 y_2 y 2
The DataFrame.pivot_table
arguments would be...
unique(.value)
.name
, .value
columns
or value
args aboveso...
> long
id n x y
0 A 1 1 NaN
1 A 2 2 a
2 B 1 3 NaN
3 B 2 4 b
wide_raw = long.pivot_table(values = ["x", "y"], columns = "n", index = ["id"], aggfunc = list)
wide_raw
x y
n 1 2 1 2
id
A [1] [2] [nan] [a]
B [3] [4] [nan] [b]
# has multi-index, so replace with "_" joined
wide_raw.columns = list(map("_".join, wide_raw.columns))
wide = (
wide_raw
.reset_index()
.rename_axis(columns = lambda x: None)
)
id x_1 x_2 y_1 y_2
0 A [1] [2] [nan] [a]
1 B [3] [4] [nan] [b]
Note that in the same way pivot_wider
has an arg like values_fn=list(n=mean, ...)
, which decides how to aggregate the columns, pivot_table
has an aggfunc arg.
A caveat here is, that unstack doesn't have a dropna argument like stack does.
Surprisingly, pivot_table
does! It seems like in your example, the spec created from build_longer_spec
is how pivot_wider knows not to create a y_1
column! If you do pivot_wider
directly it creates y_1
as all NAs...
pivot_wider(pv, names_from = c(n), values_from = c(x, y))
# A tibble: 2 x 5
id x_1 x_2 y_1 y_2
<chr> <dbl> <dbl> <chr> <chr>
1 A 1 2 NA a
2 B 3 4 NA b
pivot_table
actually, AFAICT, has a pretty complex set of choices it might make--mostly depending on the kind of data it receives, and aggfuncs.
For example...
# all object types
long.dtypes
id object
n object
x object
y object
dtype: object
# notice, now some columns are integers, and 1 is a string
converted = long.convert_dtypes()
converted.dtypes
id string
n string
x Int64
y string
dtype: object
# notice no y column in the output :/
converted.pivot_table(values = ["x", "y"], columns = "n", index = ["id"])
x
n 1 2
id
A 1 2
B 3 4
often, pandas treats certain functions faster. For example, if you do...
import numpy as np
from siuba.data import mtcars
mtcars.groupby('cyl').hp.apply(np.sum)
pandas uses kind of an intricate lookup table to figure out that it can run something np.sum
in a fast way. It is surprisingly, sort of just hard-coded in a file. When these aggregations are applied, things likely go much faster! (and everything else much slower :/... )
This seems to also be the case for pivot_table's aggfunc arg, which uses "mean" by default.
Anyway, sorry, that was a lot of exploring things--I think my three impressions are that...
Thanks for that detailed exploration and the examples. It prompted me to muck around some more with pd.pivot_table
, pd.DataFrame.pivot
and pd.DataFrame.unstack
.
The challenge with
unstack
, I think, is that it will raise an error [...] [b]ecause it wants to have 1 row per level of cylinders, and 1 column per gear, but there are multiple cars with e.g. 4 cylinders and 4 gears.pivot_table
is likeunstack
, but you also tell it what to do in these cases of multiples.
Fully agree, while it conceptually is the opposite of stack()
, the use case for pivot_table
is almost identical to the tidyverse pivot_wider
, and I would definitely go with pivot_table
when porting pivot_wider
.
Your description of the dtype
behaviour with pivot_table
is also a helpful primer for when we get to that. I'm usually a fan of keeping data types what they should be, but auto-casting with convert_dtypes()
. Looking at tidyverse/dplyr when joining columns (eg using mutate
) this might create problems.
One minor thing I just noticed: in the requirements.txt
you have pandas==0.25.3
(which I installed in a venv
, to test accordingly), but pd.DataFrame.convert_dtypes()
is a new function in pandas 1.0.0 ;)
pivot_longer
for nowI'll make sure the tests run fine first, and will not bother implementing the same functionality that spec
fulfils in the tidyverse for now. As we have seen it's not horribly complex, so hopefully it won't take too much longer, once I sit down and actually work on it.
I have been playing around with the currently failing tests, related to pivot_longer_spec
. I guess you could categorise my progress as "minor successes, with an icing of frustration".
Let me explain, using some code, and maybe you have an idea where I'm going wrong. My aim was to work backwards. First manually generate the spec
DataFrame, and see how I need to set up pivot_longer_spec
for it to provide the same output as pivot_longer
.
Minor disclaimer first off: I generated the spec
DataFrames in R first and set them manually. Here's the first currently failing test test_spec_add_multi_columns
:
# expected output from R:
# > df <- tibble(x = 1:2, y = 3:4)
# > sp <- tibble(.name = c("x", "y"), .value = "v", a = 1, b = 2)
# > pivot_longer_spec(df, sp)
# # A tibble: 4 x 3
# a b v
# <dbl> <dbl> <int>
# 1 1 2 1
# 2 1 2 3
# 3 1 2 2
# 4 1 2 4
df = data_frame(x = [1, 2], y = [3, 4])
sp = data_frame(_name = ['x', 'y'],
_value = ['v'] * 2,
a = [1, 1], b = [2, 2])
stacked = df.loc[:, sp.loc[:,'_name'].tolist()].stack()
value_name = sp.loc[:,'_value'].unique().tolist()[0]
output_col_order = sp.columns.tolist()[2:] + [value_name]
stacked = stacked.rename(value_name)
stacked.index = stacked.index.set_names([None, '_name'])
stacked_disordered = (
stacked
.reset_index(1)
.merge(sp, on = "_name", how = "left")
.drop(['_name', '_value'], 1)
)
output = stacked_disordered[order]
output
a b v
0 1 2 1
1 1 2 3
2 1 2 2
3 1 2 4
So far so good 🎉 The test should pass (I haven't formally confirmed that yet, actually).
Unfortunately, spec
is not always that 'simple'. The example we discussed above already points this out quite nicely:
> spec
# A tibble: 3 x 3
.name .value n
<chr> <chr> <chr>
1 x_1 x 1
2 x_2 x 2
3 y_2 y 2
This will not work in my code for pivot_longer_spec
, as it falls apart starting in the following line:
value_name = sp.loc[:,'_value'].unique().tolist()[0]
Even if I change remove unique().tolist()[0]
(which will break the first test, but I could single that out if all values are the same), using a list further down does not work anyway, as pd.Series.rename()
accepts a string, a dict as mapper, but not a list. So, that's pretty much where I am stuck currently.
If you have any suggestions let me know and I can give it another go :)
Here's to hoping pivot_wider
will be easier 😅
Hi,
taking you up on your tweet from a few months ago, to see what happens 😉
First off, hats off for starting this fantastic project. It has made me curious to give data analysis in python a go again, taking
siuba
for a spin.As you might know, in tidyr 1.0.0 the new verbs
pivot_long
andpivot_wider
were introduced, building ongather
andspread
respectively.Having used tidy verse for a few years already I'm quite happy with the
gather
andspread
verbs, but in line with Hadley Wickham agree that for (potential) new userspivot_longer
andpivot_wider
is probably more intuitive.As far as I can tell from a cursory glance your implementation of
gather
matches the original tidyr implementation, whereaspivot_longer
for instance offers additional parameters such asnames_to
andvalues_to
, to easily set column names.Cheers, Brian