Open samukweku opened 6 months ago
the current pivot_longer
implementation is not good enough. I'll submit a PR with improvements
I assumed (wrongly) that polars' join maintains order (it only does so for left join). need to rethink the computation logic for complete
. I'll submit a PR with improvements
eagerly awaits for 0.28.0 release!
@3SMMZRjWgS version 0.28.0 is released. would love feedback on the functions - would also love PRs if you are interested.
~possible improvement to jn.polars.pivot_longer
based on this https://github.com/pola-rs/polars/pull/18519?~ the performance hit (about 2x) is ok, considering the jn.pivot_longer
implementation is flexible enough to deal with multiple columns that are not '.value'. probably add a note advising users about this?
example below about the performance hit for a single column extraction:
import polars as pl
import janitor.polars
In [58]: df = pl.DataFrame(
...: {
...: "Sepal.Length": [5.1, 5.9],
...: "Sepal.Width": [3.5, 3.0],
...: "Petal.Length": [1.4, 5.1],
...: "Petal.Width": [0.2, 1.8],
...: "Species": ["setosa", "virginica"],
...: }
...: )
...: df
Out[58]:
shape: (2, 5)
┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┐
│ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ str │
╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡
│ 5.1 ┆ 3.5 ┆ 1.4 ┆ 0.2 ┆ setosa │
│ 5.9 ┆ 3.0 ┆ 5.1 ┆ 1.8 ┆ virginica │
└──────────────┴─────────────┴──────────────┴─────────────┴───────────┘
DF = pl.concat([df]*5_000_000,rechunk=True)
orig=(DF
.select('Species',
pl.struct(Length='Sepal.Length',Width='Sepal.Width').alias('Sepal'),
pl.struct(Length='Petal.Length',Width='Petal.Width').alias('Petal'))
.unpivot(index='Species', variable_name='part').unnest('value')
)
other=DF.pivot_longer(index='Species', names_sep='.', names_to = ('part', '.value'))
In [72]: orig.sort(pl.all()).equals(other.sort(pl.all()))
Out[72]: True
In [73]: %timeit orig=DF.select('Species', pl.struct(Length='Sepal.Length',Width='Sepal.Width').alias('Sepal'), pl.struct(Length='Petal.Length',Width='Petal
...: .Width').alias('Petal')).unpivot(index='Species', variable_name='part').unnest('value')
95.9 ms ± 1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [75]: %timeit other=DF.pivot_longer(index='Species', names_sep='.', names_to = ('part', '.value'))
188 ms ± 6.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
a 2x performance slowdown of pivot_longer
compared to manually executing this. The limitation of the orig
approach is you cannot extend it to extract multiple columns (not that i'm aware of) . take a silly example below:
In [76]: another=DF.pivot_longer(index='Species', names_pattern=r"(.{2})(.+)\.(.+)", names_to = ('part1', 'part2', '.value'))
In [77]: another
Out[77]:
shape: (20_000_000, 5)
┌───────────┬───────┬───────┬────────┬───────┐
│ Species ┆ part1 ┆ part2 ┆ Length ┆ Width │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f64 ┆ f64 │
╞═══════════╪═══════╪═══════╪════════╪═══════╡
│ setosa ┆ Pe ┆ tal ┆ 1.4 ┆ 0.2 │
│ virginica ┆ Pe ┆ tal ┆ 5.1 ┆ 1.8 │
│ setosa ┆ Pe ┆ tal ┆ 1.4 ┆ 0.2 │
│ virginica ┆ Pe ┆ tal ┆ 5.1 ┆ 1.8 │
│ setosa ┆ Pe ┆ tal ┆ 1.4 ┆ 0.2 │
│ … ┆ … ┆ … ┆ … ┆ … │
│ virginica ┆ Se ┆ pal ┆ 5.9 ┆ 3.0 │
│ setosa ┆ Se ┆ pal ┆ 5.1 ┆ 3.5 │
│ virginica ┆ Se ┆ pal ┆ 5.9 ┆ 3.0 │
│ setosa ┆ Se ┆ pal ┆ 5.1 ┆ 3.5 │
│ virginica ┆ Se ┆ pal ┆ 5.9 ┆ 3.0 │
└───────────┴───────┴───────┴────────┴───────┘
In [78]: %timeit another=DF.pivot_longer(index='Species', names_pattern=r"(.{2})(.+)\.(.+)", names_to = ('part1', 'part2', '.value'))
204 ms ± 5.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [86]: DF.select('Species', pl.struct(Length='Sepal.Length',Width='Sepal.Width').alias('Sepal'), pl.struct(Length='Petal.Length',Width='Petal.Width').alia
...: s('Petal')).unpivot(index='Species').unnest('value').with_columns(part1=pl.col.variable.str.slice(offset=0,length=2), part2=pl.col.variable.str.sli
...: ce(offset=2)).drop('variable')
Out[86]:
shape: (20_000_000, 5)
┌───────────┬────────┬───────┬───────┬───────┐
│ Species ┆ Length ┆ Width ┆ part1 ┆ part2 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ str ┆ str │
╞═══════════╪════════╪═══════╪═══════╪═══════╡
│ setosa ┆ 5.1 ┆ 3.5 ┆ Se ┆ pal │
│ virginica ┆ 5.9 ┆ 3.0 ┆ Se ┆ pal │
│ setosa ┆ 5.1 ┆ 3.5 ┆ Se ┆ pal │
│ virginica ┆ 5.9 ┆ 3.0 ┆ Se ┆ pal │
│ setosa ┆ 5.1 ┆ 3.5 ┆ Se ┆ pal │
│ … ┆ … ┆ … ┆ … ┆ … │
│ virginica ┆ 5.1 ┆ 1.8 ┆ Pe ┆ tal │
│ setosa ┆ 1.4 ┆ 0.2 ┆ Pe ┆ tal │
│ virginica ┆ 5.1 ┆ 1.8 ┆ Pe ┆ tal │
│ setosa ┆ 1.4 ┆ 0.2 ┆ Pe ┆ tal │
│ virginica ┆ 5.1 ┆ 1.8 ┆ Pe ┆ tal │
└───────────┴────────┴───────┴───────┴───────┘
In [85]: %timeit DF.select('Species', pl.struct(Length='Sepal.Length',Width='Sepal.Width').alias('Sepal'), pl.struct(Length='Petal.Length',Width='Petal.Widt
...: h').alias('Petal')).unpivot(index='Species').unnest('value').with_columns(part1=pl.col.variable.str.slice(offset=0,length=2), part2=pl.col.variable
...: .str.slice(offset=2)).drop('variable')
301 ms ± 2.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It's just a crude example of where jn.pivot_longer
can be advantageous, especially when there are multiple columns to be extracted
the long rant above does leave a question though - can we possibly speed up jn.pivot_longer
within rust? it might be possible 🤷
@samukweku I just wanted to let you know that the pl.join_where
function does not offer the same flexibility as conditional_join
as the former is implemented strictly as a inner-join. At least this is the behavior in the current stable version of Polars, but it could be subject to change in the future - link to the documentation.
Other join-types for join_where
has been requested on the polars repo, but whether/when this will be picked up by their devs is still to be determined.
Whether this changes the "feature" list to be implemented for polars in pyjanitor is for you to decide.
Either way, I am looking forward to using pyjanitor alongside polars in the future 🚀
@Phil-Garmann thanks for the feedback; it is much appreciated. I'll keep an eye on the progress for join-where
and see how it evolves.
in relation to #1343 - this is a list of functions missing in the polars library that could be implemented :
clean_names
pivot_longer
pivot_wider
xlsx_tables
xlsx_cells
read_commandline
conditional_join
~ polars has a pl.join_where to cover thiscomplete
expand_grid
~pl.join
withhow='cross'
covers thisconvert_excel_date
convert_matlab_date
convert_unix_date
~pl.from_epoch
covers thisbin_numeric
~pl.Expr.cut
covers thisconcatenate_columns
~ can be replicated withpl.concat_str
deconcatenate_columns
~pl.Expr.str.split
covers thisfactorize_columns
~pl.rank(dense)
orpl.Expr.to_physical
covers thisget_dupes
~Expr.is_duplicated()
covers thisjitter
limit_column_characters
min_max_scale
move
~ can be replicated with polars' selectorsrow_to_names
shuffle
~pl.Expr.shuffle
covers thissort_naturally
take_first
~group_by.first()
covers thisalso
Care should be taken to not create the function, if an existing solution already exists for any of these functions (probably named differently, or a combination of existing polars functions that covers all use cases of any of the listed functions above)