pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.24k stars 1.84k forks source link

Improve `pivot()` and `melt()` to make it as free to reshape data as tidyverse. #8903

Open NoToothSon-Master opened 1 year ago

NoToothSon-Master commented 1 year ago

Problem description

Compared to R's tidyverse, polars doesn't do a very good job of reshaping the data.polars' pivot() and melt() can only do the simplest data reshaping, while R's pivot_longer() and pivot_wider () can solve various data reshaping problems in data science.

NoToothSon-Master commented 1 year ago

For example, take the who dataset:

who
#> # A tibble: 7,240 × 60
#>    country     iso2  iso3   year new_sp_…¹ new_s…² new_s…³ new_s…⁴ new_s…⁵
#>    <chr>       <chr> <chr> <dbl>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#>  1 Afghanistan AF    AFG    1980        NA      NA      NA      NA      NA
#>  2 Afghanistan AF    AFG    1981        NA      NA      NA      NA      NA
#>  3 Afghanistan AF    AFG    1982        NA      NA      NA      NA      NA
#>  4 Afghanistan AF    AFG    1983        NA      NA      NA      NA      NA
#>  5 Afghanistan AF    AFG    1984        NA      NA      NA      NA      NA
#>  6 Afghanistan AF    AFG    1985        NA      NA      NA      NA      NA
#>  7 Afghanistan AF    AFG    1986        NA      NA      NA      NA      NA
#>  8 Afghanistan AF    AFG    1987        NA      NA      NA      NA      NA
#>  9 Afghanistan AF    AFG    1988        NA      NA      NA      NA      NA
#> 10 Afghanistan AF    AFG    1989        NA      NA      NA      NA      NA
#> # … with 7,230 more rows, 51 more variables: new_sp_m5564 <dbl>,
#> #   new_sp_m65 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>,
#> #   new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>,
#> #   new_sp_f5564 <dbl>, new_sp_f65 <dbl>, new_sn_m014 <dbl>,
#> #   new_sn_m1524 <dbl>, new_sn_m2534 <dbl>, new_sn_m3544 <dbl>,
#> #   new_sn_m4554 <dbl>, new_sn_m5564 <dbl>, new_sn_m65 <dbl>,
#> #   new_sn_f014 <dbl>, new_sn_f1524 <dbl>, new_sn_f2534 <dbl>, …

country,iso2, iso3, and year are already variables, so they can be left as is. But the columns from new_sp_m014 to newrel_f65 encode four variables in their names:

NoToothSon-Master commented 1 year ago

Multiple observations per row For example, take the household dataset:

household
#> # A tibble: 5 × 5
#>   family dob_child1 dob_child2 name_child1 name_child2
#>    <int> <date>     <date>     <chr>       <chr>      
#> 1      1 1998-11-26 2000-01-29 Susan       Jose       
#> 2      2 1996-06-22 NA         Mark        NA         
#> 3      3 2002-07-11 2004-04-05 Sam         Seth       
#> 4      4 2004-10-10 2009-08-27 Craig       Khai       
#> 5      5 2000-12-05 2005-02-28 Parker      Gracie

Note that we have two pieces of information (or values) for each child: their name and their dob (date of birth). These need to go into separate columns in the result. Again we supply multiple variables to names_to, using names_sep to split up each variable name. Note the special name .value: this tells pivot_longer() that that part of the column name specifies the “value” being measured (which will become a variable in the output).

household %>% 
  pivot_longer(
    cols = !family, 
    names_to = c(".value", "child"), 
    names_sep = "_", 
    values_drop_na = TRUE
  )
#> # A tibble: 9 × 4
#>   family child  dob        name  
#>    <int> <chr>  <date>     <chr> 
#> 1      1 child1 1998-11-26 Susan 
#> 2      1 child2 2000-01-29 Jose  
#> 3      2 child1 1996-06-22 Mark  
#> 4      3 child1 2002-07-11 Sam   
#> 5      3 child2 2004-04-05 Seth  
#> 6      4 child1 2004-10-10 Craig 
#> 7      4 child2 2009-08-27 Khai  
#> 8      5 child1 2000-12-05 Parker
#> 9      5 child2 2005-02-28 Gracie

Note the use of values_drop_na = TRUE: the input shape forces the creation of explicit missing variables for observations that don’t exist.

cmdlineluser commented 1 year ago

For example, take the household dataset:

This one looks like a simple enough .melt + .pivot? Update: There is no .pivot for lazyframes, alternative approach is a litte more involved.

household = b"""
family,dob_child1,dob_child2,name_child1,name_child2
1,1998-11-26,2000-01-29,Susan,Jose
2,1996-06-22,,Mark,
3,2002-07-11,2004-04-05,Sam,Seth
4,2004-10-10,2009-08-27,Craig,Khai
5,2000-12-05,2005-02-28,Parker,Gracie
"""

df = pl.read_csv(household.strip()).lazy()

names = (
   pl.Series(df.select(pl.exclude("family")).columns)   
     .str.splitn(by="_", n=2)
     .struct.rename_fields(["prefix", "suffix"])
     .struct.unnest()
     .select(pl.all().unique(maintain_order=True))
)

names = names.with_columns(pattern = 
   "^" + pl.col("prefix") + "_" + "(" + pl.col("suffix").str.concat("|") + ")" + "$"
)

columns = (
   pl.concat_list(pl.col(row["pattern"])
     .alias(row["prefix"])) 
   for row in names.to_dicts()
)

(
   df.select(
      "family",
      pl.lit([names["suffix"]]).alias("suffix"),
      *columns
   )
   .explode(pl.exclude("family"))
   .collect()
)

# shape: (10, 4)
# ┌────────┬────────┬────────────┬────────┐
# │ family ┆ suffix ┆ dob        ┆ name   │
# │ ---    ┆ ---    ┆ ---        ┆ ---    │
# │ i64    ┆ str    ┆ str        ┆ str    │
# ╞════════╪════════╪════════════╪════════╡
# │ 1      ┆ child1 ┆ 1998-11-26 ┆ Susan  │
# │ 1      ┆ child2 ┆ 2000-01-29 ┆ Jose   │
# │ 2      ┆ child1 ┆ 1996-06-22 ┆ Mark   │
# │ 2      ┆ child2 ┆ null       ┆ null   │
# │ 3      ┆ child1 ┆ 2002-07-11 ┆ Sam    │
# │ 3      ┆ child2 ┆ 2004-04-05 ┆ Seth   │
# │ 4      ┆ child1 ┆ 2004-10-10 ┆ Craig  │
# │ 4      ┆ child2 ┆ 2009-08-27 ┆ Khai   │
# │ 5      ┆ child1 ┆ 2000-12-05 ┆ Parker │
# │ 5      ┆ child2 ┆ 2005-02-28 ┆ Gracie │
# └────────┴────────┴────────────┴────────┘ 

For example, take the who dataset:

I'm not sure I follow this one exactly, but it looks like this is a .melt followed by a .join to remap the names?

# https://www.who.int/teams/global-tuberculosis-programme/data
df = pl.read_csv("https://extranet.who.int/tme/generateCSV.asp?ds=notifications").lazy()

names = pl.DataFrame({
   "variable": df.select(r"^new_?[^_]+_[mf]\d+$").columns
})

names = names.with_columns(
   diagnosis = pl.col("variable").str.extract(r"^new_?([^_]+)"),
   gender = pl.col("variable").str.extract(r"^new_?[^_]+_([mf])"),
   age = pl.col("variable").str.extract(r"(\d+)$"),
)

id_vars = df.select(
   pl.exclude(names.get_column("variable").to_list())
).columns

(df.melt(id_vars=id_vars, value_name="count")
   .join(names.lazy(), on="variable")
   .drop("variable")
   .collect())

# shape: (713_760, 123)
# ┌─────────────┬──────┬──────┬─────────────┬───┬───────┬───────────┬────────┬─────┐
# │ country     ┆ iso2 ┆ iso3 ┆ iso_numeric ┆ … ┆ count ┆ diagnosis ┆ gender ┆ age │
# │ ---         ┆ ---  ┆ ---  ┆ ---         ┆   ┆ ---   ┆ ---       ┆ ---    ┆ --- │
# │ str         ┆ str  ┆ str  ┆ i64         ┆   ┆ str   ┆ str       ┆ str    ┆ str │
# ╞═════════════╪══════╪══════╪═════════════╪═══╪═══════╪═══════════╪════════╪═════╡
# │ Afghanistan ┆ AF   ┆ AFG  ┆ 4           ┆ … ┆ null  ┆ sp        ┆ m      ┆ 04  │
# │ Afghanistan ┆ AF   ┆ AFG  ┆ 4           ┆ … ┆ null  ┆ sp        ┆ m      ┆ 04  │
# │ Afghanistan ┆ AF   ┆ AFG  ┆ 4           ┆ … ┆ null  ┆ sp        ┆ m      ┆ 04  │
# │ Afghanistan ┆ AF   ┆ AFG  ┆ 4           ┆ … ┆ null  ┆ sp        ┆ m      ┆ 04  │
# │ …           ┆ …    ┆ …    ┆ …           ┆ … ┆ …     ┆ …         ┆ …      ┆ …   │
# │ Zimbabwe    ┆ ZW   ┆ ZWE  ┆ 716         ┆ … ┆ 797   ┆ rel       ┆ f      ┆ 65  │
# │ Zimbabwe    ┆ ZW   ┆ ZWE  ┆ 716         ┆ … ┆ 652   ┆ rel       ┆ f      ┆ 65  │
# │ Zimbabwe    ┆ ZW   ┆ ZWE  ┆ 716         ┆ … ┆ 402   ┆ rel       ┆ f      ┆ 65  │
# │ Zimbabwe    ┆ ZW   ┆ ZWE  ┆ 716         ┆ … ┆ 573   ┆ rel       ┆ f      ┆ 65  │
# └─────────────┴──────┴──────┴─────────────┴───┴───────┴───────────┴────────┴─────┘

Being able to extract multiple groups at once #7990 could clean up the .str.extract part.

avimallu commented 1 year ago

@cmdlineluser, at least in the R world the reason for these functions is that they allow for "efficient" and "fast" processing of such datasets in the calls to pivot and melt itself (see the data.table vignette here). In addition, they're also pretty intuitive once you understand the syntax.

Polars should get similar benefits if these operations are done in a single step without materializing all the other DataFrames that you needed to in the manipulation you've shown (even piping them causes materialization of the DataFrame, right?).

cmdlineluser commented 1 year ago

@avimallu Thanks for the feedback.

The household / .pivot() example required a materialized dataframe - I did not realize that .pivot does not exist in the Lazy API.

I changed both examples to use the lazy api - the household example is a little more complex now.

I guess the feature request is asking for .pivot_longer() to be ported to polars though.

samukweku commented 9 months ago

I implemented pivot_longer for pandas. Interested in doing the same for polars. Any tips/guide on where/how I should go about it? Is there an interest in implementing this?

deanm0000 commented 8 months ago

@samukweku I'm not in a position to say whether or not the core team that merges PRs would want that feature in general. However, I do know that there's a strong aversion to having features in only python so I do think they'd only consider it if it were written in rust. I'm not sure if that's what your intention would have been with that respect.

samukweku commented 8 months ago

@deanm0000 I did not dwell on it since it did no seem like there was much interest in it. Yes, implementation would have to be in rust, which is the right approach. Besides, I feel with the current tools in Polars, it is possible to replicate pivot_longer in a generic form. Thanks for the feedback though - much appreciated.

samukweku commented 1 month ago

pyjanitor has a polars' variant for pivot_longer, as well as pivot_longer_spec which replicates tidyr's pivot_longer, and is efficient as well.

examples based on data shared above on this issue:

import polars as pl
import janitor.polars

household = b"""
family,dob_child1,dob_child2,name_child1,name_child2
1,1998-11-26,2000-01-29,Susan,Jose
2,1996-06-22,,Mark,
3,2002-07-11,2004-04-05,Sam,Seth
4,2004-10-10,2009-08-27,Craig,Khai
5,2000-12-05,2005-02-28,Parker,Gracie
"""

df = pl.read_csv(household.strip())

df
shape: (5, 5)
┌────────┬────────────┬────────────┬─────────────┬─────────────┐
│ family ┆ dob_child1 ┆ dob_child2 ┆ name_child1 ┆ name_child2 │
│ ---    ┆ ---        ┆ ---        ┆ ---         ┆ ---         │
│ i64    ┆ str        ┆ str        ┆ str         ┆ str         │
╞════════╪════════════╪════════════╪═════════════╪═════════════╡
│ 1      ┆ 1998-11-26 ┆ 2000-01-29 ┆ Susan       ┆ Jose        │
│ 2      ┆ 1996-06-22 ┆ null       ┆ Mark        ┆ null        │
│ 3      ┆ 2002-07-11 ┆ 2004-04-05 ┆ Sam         ┆ Seth        │
│ 4      ┆ 2004-10-10 ┆ 2009-08-27 ┆ Craig       ┆ Khai        │
│ 5      ┆ 2000-12-05 ┆ 2005-02-28 ┆ Parker      ┆ Gracie      │
└────────┴────────────┴────────────┴─────────────┴─────────────┘

df.pivot_longer(index='family', names_to = ('.value','child'), names_sep='_')
shape: (10, 4)
┌────────┬────────┬────────────┬────────┐
│ family ┆ child  ┆ dob        ┆ name   │
│ ---    ┆ ---    ┆ ---        ┆ ---    │
│ i64    ┆ str    ┆ str        ┆ str    │
╞════════╪════════╪════════════╪════════╡
│ 1      ┆ child1 ┆ 1998-11-26 ┆ Susan  │
│ 2      ┆ child1 ┆ 1996-06-22 ┆ Mark   │
│ 3      ┆ child1 ┆ 2002-07-11 ┆ Sam    │
│ 4      ┆ child1 ┆ 2004-10-10 ┆ Craig  │
│ 5      ┆ child1 ┆ 2000-12-05 ┆ Parker │
│ 1      ┆ child2 ┆ 2000-01-29 ┆ Jose   │
│ 2      ┆ child2 ┆ null       ┆ null   │
│ 3      ┆ child2 ┆ 2004-04-05 ┆ Seth   │
│ 4      ┆ child2 ┆ 2009-08-27 ┆ Khai   │
│ 5      ┆ child2 ┆ 2005-02-28 ┆ Gracie │
└────────┴────────┴────────────┴────────┘

who_url='https://raw.githubusercontent.com/tidyverse/tidyr/main/data-raw/who.csv'
df=pl.read_csv(who_url)
df.head()
shape: (5, 60)
┌─────────────┬──────┬──────┬──────┬───┬──────────────┬──────────────┬──────────────┬────────────┐
│ country     ┆ iso2 ┆ iso3 ┆ year ┆ … ┆ newrel_f3544 ┆ newrel_f4554 ┆ newrel_f5564 ┆ newrel_f65 │
│ ---         ┆ ---  ┆ ---  ┆ ---  ┆   ┆ ---          ┆ ---          ┆ ---          ┆ ---        │
│ str         ┆ str  ┆ str  ┆ i64  ┆   ┆ str          ┆ str          ┆ str          ┆ str        │
╞═════════════╪══════╪══════╪══════╪═══╪══════════════╪══════════════╪══════════════╪════════════╡
│ Afghanistan ┆ AF   ┆ AFG  ┆ 1980 ┆ … ┆ NA           ┆ NA           ┆ NA           ┆ NA         │
│ Afghanistan ┆ AF   ┆ AFG  ┆ 1981 ┆ … ┆ NA           ┆ NA           ┆ NA           ┆ NA         │
│ Afghanistan ┆ AF   ┆ AFG  ┆ 1982 ┆ … ┆ NA           ┆ NA           ┆ NA           ┆ NA         │
│ Afghanistan ┆ AF   ┆ AFG  ┆ 1983 ┆ … ┆ NA           ┆ NA           ┆ NA           ┆ NA         │
│ Afghanistan ┆ AF   ┆ AFG  ┆ 1984 ┆ … ┆ NA           ┆ NA           ┆ NA           ┆ NA         │
└─────────────┴──────┴──────┴──────┴───┴──────────────┴──────────────┴──────────────┴────────────┘

In [53]: (df.pivot_longer(
    ...:      index=['country','iso2','iso3','year'],
    ...:      names_to = ('diagnosis','gender','age'),
    ...:     names_pattern = r"new_?(.+)_(.)([0-9]+)",
    ...:     values_to='count')
    ...: )
Out[53]:
shape: (405_440, 8)
┌─────────────┬──────┬──────┬──────┬───────────┬────────┬─────┬───────┐
│ country     ┆ iso2 ┆ iso3 ┆ year ┆ diagnosis ┆ gender ┆ age ┆ count │
│ ---         ┆ ---  ┆ ---  ┆ ---  ┆ ---       ┆ ---    ┆ --- ┆ ---   │
│ str         ┆ str  ┆ str  ┆ i64  ┆ str       ┆ str    ┆ str ┆ str   │
╞═════════════╪══════╪══════╪══════╪═══════════╪════════╪═════╪═══════╡
│ Afghanistan ┆ AF   ┆ AFG  ┆ 1980 ┆ sp        ┆ m      ┆ 014 ┆ NA    │
│ Afghanistan ┆ AF   ┆ AFG  ┆ 1981 ┆ sp        ┆ m      ┆ 014 ┆ NA    │
│ Afghanistan ┆ AF   ┆ AFG  ┆ 1982 ┆ sp        ┆ m      ┆ 014 ┆ NA    │
│ Afghanistan ┆ AF   ┆ AFG  ┆ 1983 ┆ sp        ┆ m      ┆ 014 ┆ NA    │
│ Afghanistan ┆ AF   ┆ AFG  ┆ 1984 ┆ sp        ┆ m      ┆ 014 ┆ NA    │
│ …           ┆ …    ┆ …    ┆ …    ┆ …         ┆ …      ┆ …   ┆ …     │
│ Zimbabwe    ┆ ZW   ┆ ZWE  ┆ 2009 ┆ rel       ┆ f      ┆ 65  ┆ NA    │
│ Zimbabwe    ┆ ZW   ┆ ZWE  ┆ 2010 ┆ rel       ┆ f      ┆ 65  ┆ NA    │
│ Zimbabwe    ┆ ZW   ┆ ZWE  ┆ 2011 ┆ rel       ┆ f      ┆ 65  ┆ NA    │
│ Zimbabwe    ┆ ZW   ┆ ZWE  ┆ 2012 ┆ rel       ┆ f      ┆ 65  ┆ NA    │
│ Zimbabwe    ┆ ZW   ┆ ZWE  ┆ 2013 ┆ rel       ┆ f      ┆ 65  ┆ 725   │
└─────────────┴──────┴──────┴──────┴───────────┴────────┴─────┴───────┘

You also have more control by passing a specification to pivot_longer_spec:

# https://stackoverflow.com/q/61367186/7175713
from janitor.polars import pivot_longer_spec
data={'group': [1.0, 2.1, 2.2, 2.3, 2.4, 3.0],
 'BP': [0.51, 0.67, 0.36, 0.09, 0.68, 0.53],
 'HS': [0.15, 0.09, 0.13, 0.17, 0.12, 0.15],
 'BB': [0.05, 0.06, 0.07, 0.09, 0.07, 0.06],
 'lowerBP': [0.16, 0.09, 0.12, 0.13, 0.12, 0.14],
 'upperBP': [0.18, 0.11, 0.15, 0.16, 0.14, 0.16],
 'lowerHS': [0.5, 0.66, 0.34, 0.08, 0.66, 0.52],
 'upperHS': [0.52, 0.68, 0.38, 0.11, 0.69, 0.53],
 'lowerBB': [0.14, 0.08, 0.12, 0.15, 0.11, 0.15],
 'upperBB': [0.16, 0.1, 0.14, 0.18, 0.13, 0.16]}
df = pl.DataFrame(data)

df
shape: (6, 10)
┌───────┬──────┬──────┬──────┬───┬─────────┬─────────┬─────────┬─────────┐
│ group ┆ BP   ┆ HS   ┆ BB   ┆ … ┆ lowerHS ┆ upperHS ┆ lowerBB ┆ upperBB │
│ ---   ┆ ---  ┆ ---  ┆ ---  ┆   ┆ ---     ┆ ---     ┆ ---     ┆ ---     │
│ f64   ┆ f64  ┆ f64  ┆ f64  ┆   ┆ f64     ┆ f64     ┆ f64     ┆ f64     │
╞═══════╪══════╪══════╪══════╪═══╪═════════╪═════════╪═════════╪═════════╡
│ 1.0   ┆ 0.51 ┆ 0.15 ┆ 0.05 ┆ … ┆ 0.5     ┆ 0.52    ┆ 0.14    ┆ 0.16    │
│ 2.1   ┆ 0.67 ┆ 0.09 ┆ 0.06 ┆ … ┆ 0.66    ┆ 0.68    ┆ 0.08    ┆ 0.1     │
│ 2.2   ┆ 0.36 ┆ 0.13 ┆ 0.07 ┆ … ┆ 0.34    ┆ 0.38    ┆ 0.12    ┆ 0.14    │
│ 2.3   ┆ 0.09 ┆ 0.17 ┆ 0.09 ┆ … ┆ 0.08    ┆ 0.11    ┆ 0.15    ┆ 0.18    │
│ 2.4   ┆ 0.68 ┆ 0.12 ┆ 0.07 ┆ … ┆ 0.66    ┆ 0.69    ┆ 0.11    ┆ 0.13    │
│ 3.0   ┆ 0.53 ┆ 0.15 ┆ 0.06 ┆ … ┆ 0.52    ┆ 0.53    ┆ 0.15    ┆ 0.16    │
└───────┴──────┴──────┴──────┴───┴─────────┴─────────┴─────────┴─────────┘

spec = {'.name':df.columns[1:], 
              '.value':['values','values','values','lower','upper','lower','upper','lower','upper'], 
              'names':['BP','HS','BB','BP','BP','HS','HS','BB','BB']}
spec = pl.DataFrame(spec)
spec
shape: (9, 3)
┌─────────┬────────┬───────┐
│ .name   ┆ .value ┆ names │
│ ---     ┆ ---    ┆ ---   │
│ str     ┆ str    ┆ str   │
╞═════════╪════════╪═══════╡
│ BP      ┆ values ┆ BP    │
│ HS      ┆ values ┆ HS    │
│ BB      ┆ values ┆ BB    │
│ lowerBP ┆ lower  ┆ BP    │
│ upperBP ┆ upper  ┆ BP    │
│ lowerHS ┆ lower  ┆ HS    │
│ upperHS ┆ upper  ┆ HS    │
│ lowerBB ┆ lower  ┆ BB    │
│ upperBB ┆ upper  ┆ BB    │
└─────────┴────────┴───────┘

pivot_longer_spec(df=df,spec=spec)

shape: (18, 5)
┌───────┬───────┬────────┬───────┬───────┐
│ group ┆ names ┆ values ┆ lower ┆ upper │
│ ---   ┆ ---   ┆ ---    ┆ ---   ┆ ---   │
│ f64   ┆ str   ┆ f64    ┆ f64   ┆ f64   │
╞═══════╪═══════╪════════╪═══════╪═══════╡
│ 1.0   ┆ HS    ┆ 0.15   ┆ 0.5   ┆ 0.52  │
│ 2.1   ┆ HS    ┆ 0.09   ┆ 0.66  ┆ 0.68  │
│ 2.2   ┆ HS    ┆ 0.13   ┆ 0.34  ┆ 0.38  │
│ 2.3   ┆ HS    ┆ 0.17   ┆ 0.08  ┆ 0.11  │
│ 2.4   ┆ HS    ┆ 0.12   ┆ 0.66  ┆ 0.69  │
│ …     ┆ …     ┆ …      ┆ …     ┆ …     │
│ 2.1   ┆ BB    ┆ 0.06   ┆ 0.08  ┆ 0.1   │
│ 2.2   ┆ BB    ┆ 0.07   ┆ 0.12  ┆ 0.14  │
│ 2.3   ┆ BB    ┆ 0.09   ┆ 0.15  ┆ 0.18  │
│ 2.4   ┆ BB    ┆ 0.07   ┆ 0.11  ┆ 0.13  │
│ 3.0   ┆ BB    ┆ 0.06   ┆ 0.15  ┆ 0.16  │
└───────┴───────┴────────┴───────┴───────┘