vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

[BUG-REPORT] Groupby not giving right results #2048

Closed ashsharma96 closed 2 years ago

ashsharma96 commented 2 years ago

I have one sample file in which I have 4 columns named ('tm_cid', 'tm_sid', 'Date1', Freq_shop'). There are total of 14040 rows and unique tm_cid is 9762. I want to groupby on tm_cid and tm_sid and need to get the count of occurrence unique combination of tm_cid and tm_sid. So I've done that like done in below code. I've tried from different codes as written below: sample.csv

df = vaex.read_csv('sample.csv')
df1 =  df.groupby(by=['tm_cid', 'tm_sid'], agg={'Freq_shop' : vaex.agg.count('Freq_shop')})

or

df1 = df.groupby(by=['tm_cid', 'tm_sid']).agg('count')

from this I'm only getting some records. How can I get the tm_cid and tm_sid combo and their counts of occurrence in the sample csv. I'm not getting the right results. Why groupby is not giving proper results. Any Idea?

JovanVeljanoski commented 2 years ago

Why do you think it is giving a wrong result? I can't find evidence of it.. and you explicitly find one example that should exist but it does not?

Just to be clear: if a combination of ('tm_cid', 'tm_sid') does not exist in the data, it will not be automatically added in the grouped dataframe, i.e. it will not have combinations with 0 counts (i.e. it will not auto-fill the Cartesian product between tm_cid and tm_sid with 0s for combinations that do not appear in the data). Maybe this is what you mean?

ashsharma96 commented 2 years ago

Hey @JovanVeljanoski. Thanks for answering in such short time. Basically let me give you an example, If I do this count code with pandas function on that sample.csv then I will do this:

df = pd.read_csv('sample.csv')
df1 = df.groupby(['tm_cid','tm_sid'])['Freq_shop'].count().reset_index(name='Freq_shop')

and after applying this code I'll get the results which is shared as results.csv. results.csv If I check like this after groupby then:

df1.tm_cid.nunique()
9762

But when I want to do the same type of code in vaex like this :

df = vaex.read_csv('sample.csv')
df1 =  df.groupby(by=['tm_cid', 'tm_sid'], agg={'Freq_shop' : vaex.agg.count()})

then I'll get results which is shared as result_vaex.csv in which I'll get: vaex_result.csv

df1.tm_cid.nunique()
86

I want the results like pandas groupby did. But don't know why its not giving results. Combination of tm_cid and tm_sid is there for around 9762 unique tm_cid. I've shared the results of both of the groupby. Pandas is giving right results. @JovanVeljanoski Please look into this.

JovanVeljanoski commented 2 years ago

Hi,

Thanks for raising this. I need to run more tests to understand this better, not much time right now.

But if you want to reproduce the pandas result with vaex, the fastest thing you can do is use assume_sparse=False, i.e. in your original example do:

df = pd.read_csv('sample.csv')
df1 = df.groupby(['tm_cid','tm_sid'], assume_sparse=False)['Freq_shop'].count().reset_index(name='Freq_shop')

And I can confirm in the data you attached the pandas and vaex results match for the code above.

ashsharma96 commented 2 years ago

@JovanVeljanoski Thank you very much for the answer but did you get to know why groupby is showing this behavior.

maartenbreddels commented 2 years ago

@ashsharma96 thanks for the excellent report, making this easy to reproduce. @JovanVeljanoski thanks for digging into this is well.

https://github.com/vaexio/vaex/pull/2065 should fix this 🎉

ashsharma96 commented 2 years ago

@maartenbreddels @JovanVeljanoski Thank you for the acknowledgement. I'll be looking forward for the new update. I'll share more bugs in coming time.