machow / siuba

Python library for using dplyr like syntax with pandas and SQL
https://siuba.org
MIT License
1.14k stars 48 forks source link

set operations fail with filter #436

Open DSLituiev opened 2 years ago

DSLituiev commented 2 years ago

I am trying to use siuba for filtering on a set, and it seems to fail badly:

mtcars >> filter(_.cyl in {4, 5}) returns nothing, while mtcars >> filter(_.cyl == 4) works

machow commented 2 years ago

Hey, thanks for raising -- Are you trying to do the equivalent of R's %in%? You can do this using the pandas .isin() method.

from siuba.data import mtcars

mtcars.cyl.isin([4, 5])

So for siuba verbs it would be this:

from siuba import _, filter
from siuba.data import mtcars

mtcars >> filter(_.cyl.isin([4, 5]))

Sorry for the weird R -> python situation, I'm actively working on pushing out new siuba docs that walk through situations like these numeric python quirks:

DSLituiev commented 2 years ago

Hi Michael, I'm looking for an equivalent of Python x in set(...) analogue, e.g.:

mtcars.cyl.map(lambda x: x in [4, 5])

This typically works with lambda functions, so I assumed that your package would vectorize it

machow commented 2 years ago

You could use .map with siuba, but AFAICT that code in pandas will be a slower version of .isin()

mtcars >> filter(_.cyl.map(lambda x: x in [4, 5]))

Does that do what you're looking for? If there's a case where .isin() won't solve your problem, that might help me get a feel for the issue.

DSLituiev commented 2 years ago

isin would definitely solve it. I would just naïvely assume that mtcars >> filter(_.cyl in {4, 5}) would do the job. Not that I am unhappy with isin results or performance, it is just less intuitive