tidypyverse / tidypandas

A grammar of data manipulation for pandas inspired by tidyverse
https://tidypyverse.github.io/tidypandas/
MIT License
93 stars 7 forks source link

tidypandas pivot_wider behaves differently from tidyr::pivot_wider #45

Closed grahitr closed 1 year ago

grahitr commented 1 year ago

I am illustrating this with an example.

tidyr::pivot_wider behaves as follows:

> tibble::tibble(a=c(1,2,3,1)
+                , b=c(1,2,3,1)
+                , c=c("aa", "bb", "aa", "bb")
+                , d=c(1,2,3,1)) %>%
+   tidyr::pivot_wider(id_cols=c("a", "b")
+                      , names_from = "c"
+                      , values_from = "d")
# A tibble: 3 × 4
      a     b    aa    bb
  <dbl> <dbl> <dbl> <dbl>
1     1     1     1     1
2     2     2    NA     2
3     3     3     3    NA

But in tidypandas, pivot_wider computes cross products of all the index columns(doesn't matter if some combination is present in the original dataframe) and then applies pivot operation.

import pandas as pd
from tidypandas import tidyframe, tidy_accessor as tp
from tidypandas.tidy_utils import simplify
from tidypandas.tidyselect import *
>>> pd.DataFrame({"a": [1,2,3,1]
             , "b": [1,2,3,1]
             , "c": ["aa", "bb", "aa", "bb"]
             , "d": [1,2,3,1]}).tp.pivot_wider(id_cols=["a", "b"]
                                              , names_from=["c"]
                                              , values_from=["d"])

    a   b   aa  bb
0   1   1   1.0 1.0
1   1   2   NaN NaN
2   1   3   NaN NaN
3   2   1   NaN NaN
4   2   2   NaN 2.0
5   2   3   NaN NaN
6   3   1   NaN NaN
7   3   2   NaN NaN
8   3   3   3.0 NaN

While pandas pivot_table results without the creating additional rows that coming from a cross product and thus matches with tidyr::pivot_wider result

import pandas as pd
from tidypandas.tidy_utils import simplify
>>> simplify(pd.DataFrame({"a": [1,2,3,1]
             , "b": [1,2,3,1]
             , "c": ["aa", "bb", "aa", "bb"]
             , "d": [1,2,3,1]})
    .pivot_table(index=["a", "b"]
                , columns=["c"]
                , values=["d"]).reset_index(drop=False)
)
    a   b   d__aa   d__bb
0   1   1   1.0 1.0
1   2   2   NaN 2.0
2   3   3   3.0 NaN

@talegari Should we keep auto-filling with cross-product as an optional argument?

talegari commented 1 year ago

@grahitr I agree, we should not have combinations which have no value corresponding to at least one level.

I plan to implement this as follows: I will expose argument dropna in pandas.pivot_table as id_expand with default as False (aka dropna is True).

Agree?

grahitr commented 1 year ago

@talegari Yeah, I agree.

talegari commented 1 year ago

https://github.com/talegari/tidypandas/pull/46/commits/96623dff1b40136eddc38e8f0323d40d970c9443

talegari commented 1 year ago

correction: https://github.com/talegari/tidypandas/commit/23821b8405a47101b8c96028503d7beba9686e55