samukweku commented 6 months ago

PR Description

Please describe the changes proposed in the pull request:

Improve performance when sort_by_appearance is True
Added pivot_longer_spec, which allows unpivoting by hand - this allows more granular control on how the final dataframe should look in long form.

general refactoring


import pandas as pd; import janitor as jn

In [11]: events = pd.DataFrame( ...: { ...: "country": ["United States", "Russia", "China"], ...: "vault_2012_f": [ ...: 48.132, ...: 46.366, ...: 44.266, ...: ], ...: "vault_2012_m": [46.632, 46.866, 48.316], ...: "vault_2016_f": [ ...: 46.866, ...: 45.733, ...: 44.332, ...: ], ...: "vault_2016_m": [45.865, 46.033, 45.0], ...: "floor_2012_f": [45.366, 41.599, 40.833], ...: "floor_2012_m": [45.266, 45.308, 45.133], ...: "floor_2016_f": [45.999, 42.032, 42.066], ...: "floor_2016_m": [43.757, 44.766, 43.799], ...: } ...: )

In [12]: events Out[12]: country vault_2012_f vault_2012_m ... floor_2012_m floor_2016_f floor_2016_m 0 United States 48.132 46.632 ... 45.266 45.999 43.757 1 Russia 46.366 46.866 ... 45.308 42.032 44.766 2 China 44.266 48.316 ... 45.133 42.066 43.799

[3 rows x 9 columns]

events = pd.concat([events]*100_000)

dev

In [848]: %timeit events.pivot_longer(index='country', names_to=['event','year','gender'], namessep='',sort_by_appearance=False) 62.9 ms ± 361 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [849]: %timeit events.pivot_longer(index='country', names_to=['event','year','gender'], namessep='',sort_by_appearance=True) 165 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

PR

In [842]: %timeit events.pivot_longer(index='country', names_to=['event','year','gender'], namessep='',sort_by_appearance=False) 53.2 ms ± 264 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [843]: %timeit events.pivot_longer(index='country', names_to=['event','year','gender'], namessep='',sort_by_appearance=True) 48 ms ± 486 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


Performance test for *lots* of columns (YMMV):
```py
events = pd.DataFrame(
     ...:             {
     ...:                 "country": ["United States", "Russia", "China"],
     ...:                 "vault_2012_f": [
     ...:                     48.132,
     ...:                     46.366,
     ...:                     44.266,
     ...:                 ],
     ...:                 "vault_2012_m": [46.632, 46.866, 48.316],
     ...:                 "vault_2016_f": [
     ...:                     46.866,
     ...:                     45.733,
     ...:                     44.332,
     ...:                 ],
     ...:                 "vault_2016_m": [45.865, 46.033, 45.0],
     ...:                 "floor_2012_f": [45.366, 41.599, 40.833],
     ...:                 "floor_2012_m": [45.266, 45.308, 45.133],
     ...:                 "floor_2016_f": [45.999, 42.032, 42.066],
     ...:                 "floor_2016_m": [43.757, 44.766, 43.799],
     ...:             }
     ...:         )

events
         country  vault_2012_f  vault_2012_m  vault_2016_f  vault_2016_m  floor_2012_f  floor_2012_m  floor_2016_f  floor_2016_m
0  United States        48.132        46.632        46.866        45.865        45.366        45.266        45.999        43.757
1         Russia        46.366        46.866        45.733        46.033        41.599        45.308        42.032        44.766
2          China        44.266        48.316        44.332        45.000        40.833        45.133        42.066        43.799

events = events.set_index('country')
events = pd.concat([events.add_suffix(f'_{num}') for num in range(100)],axis=1)
events = pd.concat([events]*10_000)
events = events.reset_index()
In [143]: events.shape
Out[143]: (30000, 801)

# dev 
In [147]: %timeit events.pivot_longer('country', names_to=['event','year','gender','num'],names_sep='_',values_to='score', names_transform={'year':int}, sort_by_appearance=True)
2.85 s ± 34.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [148]: %timeit events.pivot_longer('country', names_to=['event','year','gender','num'],names_sep='_',values_to='score', names_transform={'year':int}, sort_by_appearance=False)
687 ms ± 7.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# this PR
 In [13]: %timeit events.pivot_longer('country', names_to=['event','year','gender','num'],names_sep='_',values_to='score', names_transform={'
    ...: year':int}, sort_by_appearance=True)
420 ms ± 3.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [14]: %timeit events.pivot_longer('country', names_to=['event','year','gender','num'],names_sep='_',values_to='score', names_transform={'
    ...: year':int}, sort_by_appearance=False)
470 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This PR resolves #1361 .

PR Checklist

Please ensure that you have done the following:

[x] PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.
[x] If you're not on the contributors list, add yourself to AUTHORS.md.
[x] Add a line to CHANGELOG.md under the latest version header (i.e. the one that is "on deck") describing the contribution.
- Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

Automatic checks

There will be automatic checks run on the PR. These include:

Building a preview of the docs on Netlify
Automatically linting the code
Making sure the code is documented
Making sure that all tests are passed
Making sure that code coverage doesn't go down.

Relevant Reviewers

Please tag maintainers to review.

@ericmjl

ericmjl commented 6 months ago

🚀 Deployed on https://deploy-preview-1362--pyjanitor.netlify.app

ericmjl commented 4 months ago

Ok, I just had a chance to look through the PR. Super high quality work! There was one file that was a tad too long where the implementation happened; I'm going to trust that it works fine. Otherwise, thank you for keeping the code test coverage high, @samukweku!

ericmjl commented 4 months ago

I am going to approve. Please do the honors of merging!

samukweku commented 4 months ago

@ericmjl thanks for the feedback... I have to figure out how to break up such PRs into small chunks

pyjanitor-devs / pyjanitor

[ENH] `pivot_longer_spec` #1362

PR Description

dev

PR

PR Checklist

Automatic checks

Relevant Reviewers