pyjanitor-devs / pyjanitor

Clean APIs for data cleaning. Python implementation of R package Janitor
https://pyjanitor-devs.github.io/pyjanitor
MIT License
1.36k stars 170 forks source link

implement similar functions for polars #1352

Open samukweku opened 6 months ago

samukweku commented 6 months ago

in relation to #1343 - this is a list of functions missing in the polars library that could be implemented :

Care should be taken to not create the function, if an existing solution already exists for any of these functions (probably named differently, or a combination of existing polars functions that covers all use cases of any of the listed functions above)

samukweku commented 5 months ago

the current pivot_longer implementation is not good enough. I'll submit a PR with improvements

samukweku commented 4 months ago

I assumed (wrongly) that polars' join maintains order (it only does so for left join). need to rethink the computation logic for complete. I'll submit a PR with improvements

3SMMZRjWgS commented 4 months ago

eagerly awaits for 0.28.0 release!

samukweku commented 3 months ago

@3SMMZRjWgS version 0.28.0 is released. would love feedback on the functions - would also love PRs if you are interested.

samukweku commented 2 months ago

~possible improvement to jn.polars.pivot_longer based on this https://github.com/pola-rs/polars/pull/18519?~ the performance hit (about 2x) is ok, considering the jn.pivot_longer implementation is flexible enough to deal with multiple columns that are not '.value'. probably add a note advising users about this?

samukweku commented 2 months ago

example below about the performance hit for a single column extraction:

import polars as pl
import janitor.polars
In [58]: df = pl.DataFrame(
    ...:     {
    ...:         "Sepal.Length": [5.1, 5.9],
    ...:         "Sepal.Width": [3.5, 3.0],
    ...:         "Petal.Length": [1.4, 5.1],
    ...:         "Petal.Width": [0.2, 1.8],
    ...:         "Species": ["setosa", "virginica"],
    ...:     }
    ...: )
    ...: df
Out[58]: 
shape: (2, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┐
│ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species   │
│ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---       │
│ f64          ┆ f64         ┆ f64          ┆ f64         ┆ str       │
╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡
│ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa    │
│ 5.9          ┆ 3.0         ┆ 5.1          ┆ 1.8         ┆ virginica │
└──────────────┴─────────────┴──────────────┴─────────────┴───────────┘

DF = pl.concat([df]*5_000_000,rechunk=True)

orig=(DF
          .select('Species', 
                       pl.struct(Length='Sepal.Length',Width='Sepal.Width').alias('Sepal'), 
                      pl.struct(Length='Petal.Length',Width='Petal.Width').alias('Petal'))
          .unpivot(index='Species', variable_name='part').unnest('value')
        )

other=DF.pivot_longer(index='Species', names_sep='.', names_to = ('part', '.value'))

In [72]: orig.sort(pl.all()).equals(other.sort(pl.all()))
Out[72]: True

In [73]: %timeit orig=DF.select('Species', pl.struct(Length='Sepal.Length',Width='Sepal.Width').alias('Sepal'), pl.struct(Length='Petal.Length',Width='Petal
    ...: .Width').alias('Petal')).unpivot(index='Species', variable_name='part').unnest('value')
95.9 ms ± 1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [75]: %timeit other=DF.pivot_longer(index='Species', names_sep='.', names_to = ('part', '.value'))
188 ms ± 6.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

a 2x performance slowdown of pivot_longer compared to manually executing this. The limitation of the orig approach is you cannot extend it to extract multiple columns (not that i'm aware of) . take a silly example below:

In [76]: another=DF.pivot_longer(index='Species', names_pattern=r"(.{2})(.+)\.(.+)", names_to = ('part1', 'part2', '.value'))

In [77]: another
Out[77]: 
shape: (20_000_000, 5)
┌───────────┬───────┬───────┬────────┬───────┐
│ Species   ┆ part1 ┆ part2 ┆ Length ┆ Width │
│ ---       ┆ ---   ┆ ---   ┆ ---    ┆ ---   │
│ str       ┆ str   ┆ str   ┆ f64    ┆ f64   │
╞═══════════╪═══════╪═══════╪════════╪═══════╡
│ setosa    ┆ Pe    ┆ tal   ┆ 1.4    ┆ 0.2   │
│ virginica ┆ Pe    ┆ tal   ┆ 5.1    ┆ 1.8   │
│ setosa    ┆ Pe    ┆ tal   ┆ 1.4    ┆ 0.2   │
│ virginica ┆ Pe    ┆ tal   ┆ 5.1    ┆ 1.8   │
│ setosa    ┆ Pe    ┆ tal   ┆ 1.4    ┆ 0.2   │
│ …         ┆ …     ┆ …     ┆ …      ┆ …     │
│ virginica ┆ Se    ┆ pal   ┆ 5.9    ┆ 3.0   │
│ setosa    ┆ Se    ┆ pal   ┆ 5.1    ┆ 3.5   │
│ virginica ┆ Se    ┆ pal   ┆ 5.9    ┆ 3.0   │
│ setosa    ┆ Se    ┆ pal   ┆ 5.1    ┆ 3.5   │
│ virginica ┆ Se    ┆ pal   ┆ 5.9    ┆ 3.0   │
└───────────┴───────┴───────┴────────┴───────┘

In [78]: %timeit another=DF.pivot_longer(index='Species', names_pattern=r"(.{2})(.+)\.(.+)", names_to = ('part1', 'part2', '.value'))
204 ms ± 5.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [86]: DF.select('Species', pl.struct(Length='Sepal.Length',Width='Sepal.Width').alias('Sepal'), pl.struct(Length='Petal.Length',Width='Petal.Width').alia
    ...: s('Petal')).unpivot(index='Species').unnest('value').with_columns(part1=pl.col.variable.str.slice(offset=0,length=2), part2=pl.col.variable.str.sli
    ...: ce(offset=2)).drop('variable')
Out[86]: 
shape: (20_000_000, 5)
┌───────────┬────────┬───────┬───────┬───────┐
│ Species   ┆ Length ┆ Width ┆ part1 ┆ part2 │
│ ---       ┆ ---    ┆ ---   ┆ ---   ┆ ---   │
│ str       ┆ f64    ┆ f64   ┆ str   ┆ str   │
╞═══════════╪════════╪═══════╪═══════╪═══════╡
│ setosa    ┆ 5.1    ┆ 3.5   ┆ Se    ┆ pal   │
│ virginica ┆ 5.9    ┆ 3.0   ┆ Se    ┆ pal   │
│ setosa    ┆ 5.1    ┆ 3.5   ┆ Se    ┆ pal   │
│ virginica ┆ 5.9    ┆ 3.0   ┆ Se    ┆ pal   │
│ setosa    ┆ 5.1    ┆ 3.5   ┆ Se    ┆ pal   │
│ …         ┆ …      ┆ …     ┆ …     ┆ …     │
│ virginica ┆ 5.1    ┆ 1.8   ┆ Pe    ┆ tal   │
│ setosa    ┆ 1.4    ┆ 0.2   ┆ Pe    ┆ tal   │
│ virginica ┆ 5.1    ┆ 1.8   ┆ Pe    ┆ tal   │
│ setosa    ┆ 1.4    ┆ 0.2   ┆ Pe    ┆ tal   │
│ virginica ┆ 5.1    ┆ 1.8   ┆ Pe    ┆ tal   │
└───────────┴────────┴───────┴───────┴───────┘

In [85]: %timeit DF.select('Species', pl.struct(Length='Sepal.Length',Width='Sepal.Width').alias('Sepal'), pl.struct(Length='Petal.Length',Width='Petal.Widt
    ...: h').alias('Petal')).unpivot(index='Species').unnest('value').with_columns(part1=pl.col.variable.str.slice(offset=0,length=2), part2=pl.col.variable
    ...: .str.slice(offset=2)).drop('variable')
301 ms ± 2.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It's just a crude example of where jn.pivot_longer can be advantageous, especially when there are multiple columns to be extracted

samukweku commented 2 months ago

the long rant above does leave a question though - can we possibly speed up jn.pivot_longer within rust? it might be possible 🤷

Phil-Garmann commented 1 month ago

@samukweku I just wanted to let you know that the pl.join_where function does not offer the same flexibility as conditional_join as the former is implemented strictly as a inner-join. At least this is the behavior in the current stable version of Polars, but it could be subject to change in the future - link to the documentation.

Other join-types for join_where has been requested on the polars repo, but whether/when this will be picked up by their devs is still to be determined. Whether this changes the "feature" list to be implemented for polars in pyjanitor is for you to decide.

Either way, I am looking forward to using pyjanitor alongside polars in the future 🚀

samukweku commented 1 month ago

@Phil-Garmann thanks for the feedback; it is much appreciated. I'll keep an eye on the progress for join-where and see how it evolves.