strange behaviour when using filter + if_else

machow / siuba

Python library for using dplyr like syntax with pandas and SQL

MIT License

1.14k stars 48 forks source link

Hi - the following example produces strange results:

import siuba as sb
from siuba import _, mutate, count, if_else
from siuba.data import penguins

print(f'initial rows:{penguins.shape[0]}')
dat = penguins >> sb.filter(_.island != "Torgersen") 
print(f'rows after filtering:{dat.shape[0]}')

dat = dat >> mutate(
    binary_col = if_else(_.island == 'Biscoe', 1, 0)
    )

dat_count = dat >> count(_.binary_col )
print(dat_count)

I use a filter to drop some of the rows. When using mutate on the filtered dataframe the previously dropped rows somehow still appear in the dataframe.

I would expect a count output like:

   binary_col    n
0         0.0  110
1         1.0  130

but the dropped observations get labeled with NaN

   binary_col    n
0         0.0  110
1         1.0  130
2         NaN   52

What am I doing wrong?

machow / siuba

strange behaviour when using filter + if_else #474