pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.31k stars 1.96k forks source link

cannot do arithmetic with series of dtype: Boolean and argument of type: int #12235

Open zfaee opened 1 year ago

zfaee commented 1 year ago

Checks

Reproducible example

import polars as pl

s = pl.Series([True])
print(s.to_frame())
op_names = (
    '__add__', '__radd__', '__pos__',
    '__sub__', '__rsub__', '__neg__',
    '__mul__', '__rmul__',
    '__truediv__', '__rtruediv__', 
    '__floordiv__', '__rfloordiv__')
for op_name in op_names:
    try: 
        getattr(s, op_name)(1)
    except TypeError:
        print('TypeError: ', op_name)

Log output

shape: (1, 1)
┌──────┐
│      │
│ ---  │
│ bool │
╞══════╡
│ true │
└──────┘
TypeError:  __add__
TypeError:  __radd__
TypeError:  __pos__
TypeError:  __sub__
TypeError:  __rsub__
TypeError:  __neg__
TypeError:  __mul__
TypeError:  __rmul__
TypeError:  __rfloordiv__

Issue description

cannot do arithmetic with series of dtype: Boolean and argument of type: int, e.g. s + 1, 1 * s

Expected behavior

expected to be supported.

Installed versions

``` --------Version info--------- Polars: 0.19.12 Index type: UInt32 Platform: Windows-10-10.0.22621-SP0 Python: 3.11.4 | packaged by Anaconda, Inc. | (main, Jul 5 2023, 13:38:37) [MSC v.1916 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_sqlite: cloudpickle: connectorx: deltalake: fsspec: gevent: matplotlib: numpy: 1.25.2 openpyxl: pandas: 2.0.3 pyarrow: pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```
alexander-beedie commented 1 year ago

Why would you expect to be able to? True / 2, False - 8 , or 10 // -True don't mean anything, and if you explicitly want to treat True/False as 1/0 integers then you can just cast to make the intent clear. Otherwise automatic casting here seems much more likely to allow for the introduction of bugs (given that most math ops on bool are invalid) ;)

zfaee commented 1 year ago

Why would you expect to be able to? True / 2, False - 8 , or 10 // -True don't mean anything, and if you explicitly want to treat True/False as 1/0 integers then you can just cast to make the intent clear. Otherwise automatic casting here seems much more likely to allow for the introduction of bugs (given that most math ops on bool are invalid) ;)

just run:

import polars as pl

print(True / 2, False - 8 , 10 // -True)

df = pl.DataFrame({'s': [True]})
s = pl.col('s')
print(df.with_columns(s_true_divide=s / 2, s_sub=s - 8, s_floordiv=s // 10))

output:

0.5 -8 -10
shape: (1, 4)
┌──────┬───────────────┬───────┬────────────┐
│ s    ┆ s_true_divide ┆ s_sub ┆ s_floordiv │
│ ---  ┆ ---           ┆ ---   ┆ ---        │
│ bool ┆ f64           ┆ i32   ┆ i32        │
╞══════╪═══════════════╪═══════╪════════════╡
│ true ┆ 0.5           ┆ -7    ┆ 0          │
└──────┴───────────────┴───────┴────────────┘

so we can see that these are effective!

alexander-beedie commented 1 year ago

so we can see that these are effective!

That is because (in Python) booleans are a subtype of integer. However... none of those results appear to mean anything ;)

zfaee commented 1 year ago

so we can see that these are effective!

That is because in Python booleans are a subtype of integer. However... none of those results appear to mean anything ;)

This behavior of Series is inconsistent with DataFrame, native python, and numpy, which will confuse developers.

Anymore, bool dtype series supports arithmetic wtih float, why not int?

stinodego commented 1 year ago

~Supertypes have not been defined for Boolean and signed integers. I believe that's an oversight, and they have been added for the reverse, as well as for unsigned integers.~

The issue seems to be the implementation of arithmetic on Series. There is actually a test for this behavior that mentions "do we want this to work?": https://github.com/pola-rs/polars/blob/228c89a3b998a27d873110cd9462b45175a905d0/py-polars/tests/unit/series/test_series.py#L1603-L1637

I would say yes. The following also works just fine:

s1 = pl.Series([True, False])
s2 = pl.Series([5])

print(s1 + s2)
shape: (2,)
Series: '' [i64]
[
        6
        5
]

So a fix is needed of the Series.__add__ etc. methods, as well as Series._arithmetic.

alexander-beedie commented 1 year ago

I'd have said "no" and we should error elsewhere (because True + 8 or False * 0.12345 don't make any sense) but... perhaps this is the path of least resistance if it's already working in other contexts 😜