pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.43k stars 17.86k forks source link

pd.DataFrame.and_ is not exposed #15070

Open Erotemic opened 7 years ago

Erotemic commented 7 years ago

Code Sample, a copy-pastable example if possible

import pandas as pd
data = pd.DataFrame.from_dict({'a': [1, 0, 0, 1], 'b': [0, 0, 1, 1]}, dtype=bool)
flags = pd.Series([1, 0, 0, 1], dtype=bool)

# This raises a warning 
data.mul(flags, axis=0)
#/home/joncrall/venv2/local/lib/python2.7/site-packages/pandas/computation/expressions.py:182: #UserWarning: evaluating in Python space because the '*' operator is not supported by numexpr for the #bool dtype, use '&' instead

# But we can't do data & mask because we need to specify axis=0, instead we have to do
data.__and__(mask, axis=0)
# Why is there no data.and_(mask, axis=0)?

Problem description

I have a use case where I have a data frame of where rows are samples and columns are sample properties. I also have a Series with associated flags that indicate which rows should have all values set to false.

At first I thought I could just the & operator, but I ran into a broadcasting problem ValueError: operands could not be broadcast together with shapes (16,) (4,). So, I started to look for another way. I don't want to do this inplace so data[flags] = False wont work unless I use copy (but that's ugly).

Instead I want to use an and function, but I couldn't find one (I'm still a bit new to the pandas API). So, I settled for using multiplication. However, I got a warning when I used data.mul. It told me to use & which tipped me off that there was some way to call a and function with the kwargs axis=0. I found that using data.__and__(flags, axis=0) will work.

I'm wondering why data.mul is exposed as a public version of data.__mul__ but why is there no data.and_ version of data.__and__?

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.6.final.0 python-bits: 64 OS: Linux OS-release: 3.13.0-106-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None pandas: 0.19.0 nose: 1.3.7 pip: 9.0.1 setuptools: 25.2.0 Cython: 0.24.1 numpy: 1.11.2 scipy: 0.18.0 statsmodels: 0.8.0rc1 xarray: None IPython: 5.1.0 sphinx: None patsy: 0.4.1 dateutil: 2.5.3 pytz: 2016.7 blosc: None bottleneck: None tables: None numexpr: None matplotlib: 1.5.1 openpyxl: None xlrd: 1.0.0 xlwt: None xlsxwriter: None lxml: None bs4: 4.5.1 html5lib: None httplib2: None apiclient: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.7.1 boto: 2.42.0 pandas_datareader: None
bkandel commented 7 years ago

Seems to me simplest is just:

data.loc[flags, :] = False
bkandel commented 7 years ago

Sorry, I missed that you didn't want to do this inplace. In that case the way to do it is with an apply:

data.apply(lambda x: x & flags, axis=0)
Erotemic commented 7 years ago

That makes sense, I like apply as a solution to this problem.


Is there any design-driven reason why methods like and_, or_, and xor are not exposed as public methods of DataFrame's? Even though apply is a nice way to do this, I think public method versions of these binary operations would be a useful improvement to pandas. Would a PR to add public versions of these methods be welcome?

mroeschke commented 5 years ago

Looks like these work now:

In [25]: pd.__version__
Out[25]: '0.24.0rc1+1.g33f91d8f9'

In [26]: data.mul(flags, axis=0)
Out[26]:
       a      b
0   True  False
1  False  False
2  False  False
3   True   True

In [27]: data.__and__(flags, axis=0)
Out[27]:
       a      b
0   True  False
1  False  False
2  False  False
3   True   True
mroeschke commented 5 years ago

Oh sorry, the request was for a DataFrame.and, DataFrame.or, etc...