pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.73k stars 17.95k forks source link

API: disallow div/floordiv/pow operators for BooleanArray ? #41165

Open jorisvandenbossche opened 3 years ago

jorisvandenbossche commented 3 years ago

Currently, for the plain bool dtype we explicitly check for some operations and raise an error, while those actually work in numpy. For example:

>>> arr = np.array([True, False, True])
>>> arr / True
array([1., 0., 1.])

>>> pd.Series(arr) / True
...
NotImplementedError: operator '/' not implemented for bool dtypes

This is done for the division and power operations (not_allowed={"/", "//", "**"}):

https://github.com/pandas-dev/pandas/blob/934cad6ab61b867c6ae54941c5cd87340d44b80a/pandas/core/computation/expressions.py#L215-L218

For the nullable BooleanArray, for now we simply relied on the operations as defined by the underlying numpy bool array:

>>> pd.array(arr) / True
<FloatingArray>
[1.0, 0.0, 1.0]
Length: 3, dtype: Float64

That's for the BooleanArray, but the check is currently done on the "array_op" level (but because it is done within expressions.py, we don't run that check for EAs, xref https://github.com/pandas-dev/pandas/pull/41161).

So questions:

dsaxton commented 3 years ago

@jorisvandenbossche What's the argument for disallowing at the moment? To me it seems more natural / Pythonic to allow this operation since booleans are essentially ints.

jorisvandenbossche commented 3 years ago

I am not sure what the historic reasons are. Maybe because those operations were regarded as not that useful (although I would say it's up to the user to decide that), or as potentially confusing because they don't have a "boolean" interpretation, but only a numerical one (eg + and * still are a boolean operation resulting in booleans, and eg - raises an error in numpy about not being supported for booleans). It's only for division and power that the booleans are interpreted as numeric values (and currently in pandas the user needs to be explicit about such casting).

dsaxton commented 3 years ago

@jorisvandenbossche Yeah, I can see the argument of "why would anyone do this?" but if it's easy to allow and makes for greater consistency with numpy and Python more generally, I personally like the idea of allowing this rather than making a special case here.

jbrockmendel commented 3 years ago

conditional on the Series behavior (which i wouldnt object to deprecating), I lean towards having BooleanArray behave like Series, i.e. raising here

jbrockmendel commented 2 years ago

Current thought here: BooleanArray (and IntegerArray and FloatingArray) ops should be wrappers around their core.ops.array_ops counterparts. This will allow for pushing more logic down from BooleanArray/NumericArray into BaseMaskedArray, which in turn will make it easier to extend BaseMaskedArray to wrap arbitrary dtypes.

wence- commented 1 year ago

or as potentially confusing because they don't have a "boolean" interpretation

FWIW, booleans are a canonical identification for GF(2) so I would argue that these operations are all well-defined and have a unique interpretation.

jbrockmendel commented 1 year ago

GF(2)

We don't have an official policy on this, but in general pandas is more averse to overflows than numpy, which corresponds to not treating arithmetic as modular.