pwwang / datar

A Grammar of Data Manipulation in python
https://pwwang.github.io/datar/
MIT License
267 stars 17 forks source link

unnest is very slow . #156

Closed antonio-yu closed 1 year ago

antonio-yu commented 1 year ago

unnest takes very long time compared to the pandas methods. For example

import pandas as pd 
from datar.all import *
d ={'name': [['a','b','c','d','e','f']]*200000}
test = pd.DataFrame(d)
test >> unnest(f.name)

# it takes 1m 37.6s 

# pandas method
pd.DataFrame(np.concatenate(test.name.values))

#it takes only 2.2s
pwwang commented 1 year ago

How about unchop?

antonio-yu commented 1 year ago

@pwwang unchop is very fast. Only 2s taken

pwwang commented 1 year ago

You should use unchop in your case. See https://tidyr.tidyverse.org/reference/chop.html#ref-examples

In your case, you are unwrapping python lists not tibbles/dfs.

unnest has extra steps to parse tibbles.

antonio-yu commented 1 year ago

Thanks ,@pwwang. I was a little bit confused about unchop and unnest ,It seems they both could approach the same result. The unnest returns the same col name which I dont need to rename , but unchop returns a sequential col name such as name$x.

I would try more to feel the difference.

pwwang commented 1 year ago

Close it for now.