has2k1 / plydata

A grammar for data manipulation in Python
https://plydata.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
275 stars 11 forks source link

`arrange` on concat'd dataframes duplicates some rows #22

Closed georgemarrows closed 4 years ago

georgemarrows commented 4 years ago

Thank you for the speedy fixes for my previous reports.

This had me quite confused:

> df1 = pd.DataFrame([("a", "b"), ("c", "d")], columns=['x', 'y'])
> len(df1)
2

> df2 = pd.DataFrame([("aa", "ba"), ("ca", "da"), ("ea", "fa")], columns=['x', 'y'])
> len(df2)
3

> len(pd.concat([df1, df2]))
5

> len(pd.concat([df1, df2]) >> arrange('x'))
9

Probably because pd.concat([df1, df2]) has duplicate values in the index:

    x   y
0   a   b
1   c   d
0   aa  ba
1   ca  da
2   ea  fa

Perhaps worth a permanent assert len(df_in) == len(df_out) in arrange?

georgemarrows commented 4 years ago

Thanks for the fix!