pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.22k stars 1.95k forks source link

DataFrame and Series addition coerces non-String columns to String #14088

Open Wainberg opened 9 months ago

Wainberg commented 9 months ago

Checks

Reproducible example

>>> pl.Series([1, 2, 3]) + pl.Series(['1', '2', '3'])
shape: (3,)
Series: '' [str]
[
        "11"
        "22"
        "33"
]
>>> pl.DataFrame([1, 2, 3]) + pl.DataFrame(['1', '2', '3'])
shape: (3, 1)
┌──────────┐
│ column_0 │
│ ---      │
│ str      │
╞══════════╡
│ 11       │
│ 22       │
│ 33       │
└──────────┘
>>> pl.Series([False, True, False]) + pl.Series(['1', '2', '3'])
shape: (3,)
Series: '' [str]
[
        "false1"
        "true2"
        "false3"
]
>>> pl.DataFrame([False, True, False]) + pl.DataFrame(['1', '2', '3'])
shape: (3, 1)
┌──────────┐
│ column_0 │
│ ---      │
│ str      │
╞══════════╡
│ false1   │
│ true2    │
│ false3   │
└──────────┘
>>> pl.Series([1., 2., 3.]) + pl.Series(['1', '2', '3'])
shape: (3,)
Series: '' [str]
[
        "1.01"
        "2.02"
        "3.03"
]
>>> pl.DataFrame([1., 2., 3.]) + pl.DataFrame(['1', '2', '3'])
shape: (3, 1)
┌──────────┐
│ column_0 │
│ ---      │
│ str      │
╞══════════╡
│ 1.01     │
│ 2.02     │
│ 3.03     │
└──────────┘

Issue description

DataFrame and Series addition coerces non-String columns to String.

Expected behavior

All of these should give an error.

Installed versions

``` --------Version info--------- Polars: 0.20.5 Index type: UInt32 Platform: Linux-4.4.0-22621-Microsoft-x86_64-with-glibc2.35 Python: 3.12.1 | packaged by conda-forge | (main, Dec 23 2023, 08:03:24) [GCC 12.3.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fsspec: gevent: hvplot: matplotlib: 3.8.2 numpy: 1.26.3 openpyxl: 3.1.2 pandas: 2.2.0 pyarrow: 14.0.2 pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: 0.8.1 xlsxwriter: 3.1.9 ```
Wainberg commented 9 months ago

Coercion to string causes issues beyond just addition: for instance, this error message incorrectly says the types are str and str rather than int and str (or should it say Int64 and String?):

>>> pl.DataFrame({'a': [1, 2, 3]}) * pl.DataFrame({'a': ['1', '2', '3']})
thread '<unnamed>' panicked at crates/polars-core/src/series/arithmetic/borrowed.rs:470:44:
data types don't match: InvalidOperation(ErrString("mul operation not supported for dtypes `str` and `str`"))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "polars/dataframe/frame.py", line 1534, in __mul__
    return self._from_pydf(self._df.mul_df(other._df))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: data types don't match: InvalidOperation(ErrString("mul operation not supported for dtypes `str` and `str`"))