pyjanitor-devs / pyjanitor

Clean APIs for data cleaning. Python implementation of R package Janitor
https://pyjanitor-devs.github.io/pyjanitor
MIT License
1.37k stars 170 forks source link

perf slower when `sort_by_appearance` is True for `pivot_longer` #1361

Closed samukweku closed 4 months ago

samukweku commented 6 months ago
import pandas as pd; import janitor as jn

In [11]: events = pd.DataFrame(
    ...:             {
    ...:                 "country": ["United States", "Russia", "China"],
    ...:                 "vault_2012_f": [
    ...:                     48.132,
    ...:                     46.366,
    ...:                     44.266,
    ...:                 ],
    ...:                 "vault_2012_m": [46.632, 46.866, 48.316],
    ...:                 "vault_2016_f": [
    ...:                     46.866,
    ...:                     45.733,
    ...:                     44.332,
    ...:                 ],
    ...:                 "vault_2016_m": [45.865, 46.033, 45.0],
    ...:                 "floor_2012_f": [45.366, 41.599, 40.833],
    ...:                 "floor_2012_m": [45.266, 45.308, 45.133],
    ...:                 "floor_2016_f": [45.999, 42.032, 42.066],
    ...:                 "floor_2016_m": [43.757, 44.766, 43.799],
    ...:             }
    ...:         )

In [12]: events
Out[12]:
         country  vault_2012_f  vault_2012_m  ...  floor_2012_m  floor_2016_f  floor_2016_m
0  United States        48.132        46.632  ...        45.266        45.999        43.757
1         Russia        46.366        46.866  ...        45.308        42.032        44.766
2          China        44.266        48.316  ...        45.133        42.066        43.799

[3 rows x 9 columns]

events = pd.concat([events]*100_000)

In [14]: %timeit events.pivot_longer(index='country', names_to=['event','year','gender'], names_sep='_',so
    ...: rt_by_appearance=False)
62.5 ms ± 288 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [15]: %timeit events.pivot_longer(index='country', names_to=['event','year','gender'], names_sep='_',so
    ...: rt_by_appearance=True)
176 ms ± 1.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

maybe we can improve the performance?