Closed omri374 closed 6 years ago
Good catch! This is in fact a bug. It was happening because I was using the original dataframe's index to sort, then re-indexing with the sorted indices. When there were duplicate indices it would duplicate the rows.
Should be fixed now. I just changed to indexing using .iloc instead.
I tried the same on my machine with the new master branch:
from dfply import *
utime = pd.DataFrame({"u":1,"eventTime":["01-01-1971 01:04:00","01-01-1971 02:07:00","01-01-1971 01:09:00","01-01-1971 01:10:00"]})
print(utime >> arrange(X.eventTime))
eventTime u
0 01-01-1971 01:04:00 1
2 01-01-1971 01:09:00 1
3 01-01-1971 01:10:00 1
1 01-01-1971 02:07:00 1
utime = utime.set_index("u")
print(utime >> arrange(X.eventTime))
eventTime
u
1 01-01-1971 01:04:00
1 01-01-1971 01:09:00
1 01-01-1971 01:10:00
1 01-01-1971 02:07:00
Which is the behavior you expected. If you pull the master branch and reinstall it should work.
Hi, Please take a look at the following example:
from dfply import * utime = pd.DataFrame({"u":1,"eventTime":["01-01-1971 01:04:00","01-01-1971 02:07:00","01-01-1971 01:09:00","01-01-1971 01:10:00"]}) print(utime >> arrange(X.eventTime))
utime = utime.set_index("u") print(utime >> d.arrange(X.eventTime))
In the first option, the result is as expected. When introducing an index, the result is incorrect and contains 4 times as many values as before.
I'm not sure if it's bug or an expected behavior, as I'm a newbie to pandas and to indices of data frames.
output for the code: eventTime u 0 01-01-1971 01:04:00 1 2 01-01-1971 01:09:00 1 3 01-01-1971 01:10:00 1 1 01-01-1971 02:07:00 1 eventTime u
1 01-01-1971 01:04:00 1 01-01-1971 02:07:00 1 01-01-1971 01:09:00 1 01-01-1971 01:10:00 1 01-01-1971 01:04:00 1 01-01-1971 02:07:00 1 01-01-1971 01:09:00 1 01-01-1971 01:10:00 1 01-01-1971 01:04:00 1 01-01-1971 02:07:00 1 01-01-1971 01:09:00 1 01-01-1971 01:10:00 1 01-01-1971 01:04:00 1 01-01-1971 02:07:00 1 01-01-1971 01:09:00 1 01-01-1971 01:10:00