machow / siuba

Python library for using dplyr like syntax with pandas and SQL
https://siuba.org
MIT License
1.14k stars 48 forks source link

strange behaviour when using filter + if_else #474

Open danielspringt opened 1 year ago

danielspringt commented 1 year ago

Hi - the following example produces strange results:

import siuba as sb
from siuba import _, mutate, count, if_else
from siuba.data import penguins

print(f'initial rows:{penguins.shape[0]}')
dat = penguins >> sb.filter(_.island != "Torgersen") 
print(f'rows after filtering:{dat.shape[0]}')

dat = dat >> mutate(
    binary_col = if_else(_.island == 'Biscoe', 1, 0)
    )

dat_count = dat >> count(_.binary_col )
print(dat_count)

I use a filter to drop some of the rows. When using mutate on the filtered dataframe the previously dropped rows somehow still appear in the dataframe.

I would expect a count output like:

   binary_col    n
0         0.0  110
1         1.0  130

but the dropped observations get labeled with NaN

   binary_col    n
0         0.0  110
1         1.0  130
2         NaN   52

What am I doing wrong?

jonesworks commented 1 year ago

Note that after filtering, the index is not reset.

Instead, try this:

dat = (penguins >> sb.filter(_.island != "Torgersen")).reset_index(drop=True)

I've encountered similar issues in R, the resolution of which was droplevels()

Also, running this will perhaps shed a bit more light on discrepancy between output above and expected output.

( 
    penguins 
    >> group_by( 
        _.island
    )
    >> count() 
    >> arrange(-_.n)
)