kieferk / dfply

dplyr-style piping operations for pandas dataframes
GNU General Public License v3.0
889 stars 103 forks source link

Issue with 'arrange' when df has an index #47

Closed omri374 closed 6 years ago

omri374 commented 6 years ago

Hi, Please take a look at the following example:

from dfply import * utime = pd.DataFrame({"u":1,"eventTime":["01-01-1971 01:04:00","01-01-1971 02:07:00","01-01-1971 01:09:00","01-01-1971 01:10:00"]}) print(utime >> arrange(X.eventTime))

utime = utime.set_index("u") print(utime >> d.arrange(X.eventTime))

In the first option, the result is as expected. When introducing an index, the result is incorrect and contains 4 times as many values as before.

I'm not sure if it's bug or an expected behavior, as I'm a newbie to pandas and to indices of data frames.

output for the code: eventTime u 0 01-01-1971 01:04:00 1 2 01-01-1971 01:09:00 1 3 01-01-1971 01:10:00 1 1 01-01-1971 02:07:00 1 eventTime u
1 01-01-1971 01:04:00 1 01-01-1971 02:07:00 1 01-01-1971 01:09:00 1 01-01-1971 01:10:00 1 01-01-1971 01:04:00 1 01-01-1971 02:07:00 1 01-01-1971 01:09:00 1 01-01-1971 01:10:00 1 01-01-1971 01:04:00 1 01-01-1971 02:07:00 1 01-01-1971 01:09:00 1 01-01-1971 01:10:00 1 01-01-1971 01:04:00 1 01-01-1971 02:07:00 1 01-01-1971 01:09:00 1 01-01-1971 01:10:00

kieferk commented 6 years ago

Good catch! This is in fact a bug. It was happening because I was using the original dataframe's index to sort, then re-indexing with the sorted indices. When there were duplicate indices it would duplicate the rows.

Should be fixed now. I just changed to indexing using .iloc instead.

I tried the same on my machine with the new master branch:

from dfply import *
utime = pd.DataFrame({"u":1,"eventTime":["01-01-1971 01:04:00","01-01-1971 02:07:00","01-01-1971 01:09:00","01-01-1971 01:10:00"]})

print(utime >> arrange(X.eventTime))
             eventTime  u
0  01-01-1971 01:04:00  1
2  01-01-1971 01:09:00  1
3  01-01-1971 01:10:00  1
1  01-01-1971 02:07:00  1

utime = utime.set_index("u")

print(utime >> arrange(X.eventTime))
             eventTime
u                     
1  01-01-1971 01:04:00
1  01-01-1971 01:09:00
1  01-01-1971 01:10:00
1  01-01-1971 02:07:00

Which is the behavior you expected. If you pull the master branch and reinstall it should work.