pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.3k stars 1.96k forks source link

Fully support NumPy arrays in DataFrame and Series operators #14077

Open Wainberg opened 9 months ago

Wainberg commented 9 months ago

Checks

Reproducible example

  1. DataFrame + 1D array:
>>> pl.DataFrame([1, 2, 3]) + np.array([1, 2, 3])
thread '<unnamed>' panicked at crates/polars-core/src/series/series_trait.rs:147:13:
`add` operation not supported for dtype `list[i64]`
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "polars/py-polars/polars/dataframe/frame.py", line 1550, in __add__
    return self._from_pydf(self._df.add(other._s))
                           ^^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: `add` operation not supported for dtype `list[i64]`
  1. DataFrame + 2D array:
>>> pl.DataFrame([1, 2, 3]) + np.array([[1, 2, 3]])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "polars/py-polars/polars/dataframe/frame.py", line 1549, in __add__
    other = _prepare_other_arg(other)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "polars/py-polars/polars/dataframe/frame.py", line 10959, in _prepare_other_arg
    other = pl.Series("", [other])
            ^^^^^^^^^^^^^^^^^^^^^^
  File "polars/py-polars/polars/series/series.py", line 298, in __init__
    self._s = sequence_to_pyseries(
              ^^^^^^^^^^^^^^^^^^^^^
  File "polars/py-polars/polars/utils/_construction.py", line 588, in sequence_to_pyseries
    dtype = numpy_char_code_to_dtype(np.dtype(python_dtype).char)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "polars/py-polars/polars/datatypes/convert.py", line 478, in numpy_char_code_to_dtype
    raise ValueError(msg) from None
ValueError: cannot parse numpy data type dtype('O') into Polars data type
  1. DataFrame + 0D array:
>>> pl.DataFrame([1, 2, 3]) + np.array(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "polars/py-polars/polars/dataframe/frame.py", line 1549, in __add__
    other = _prepare_other_arg(other)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "polars/py-polars/polars/dataframe/frame.py", line 10959, in _prepare_other_arg
    other = pl.Series("", [other])
            ^^^^^^^^^^^^^^^^^^^^^^
  File "polars/py-polars/polars/series/series.py", line 298, in __init__
    self._s = sequence_to_pyseries(
              ^^^^^^^^^^^^^^^^^^^^^
  File "polars/py-polars/polars/utils/_construction.py", line 588, in sequence_to_pyseries
    dtype = numpy_char_code_to_dtype(np.dtype(python_dtype).char)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "polars/py-polars/polars/datatypes/convert.py", line 478, in numpy_char_code_to_dtype
    raise ValueError(msg) from None
ValueError: cannot parse numpy data type dtype('O') into Polars data type
  1. Series + 2D array:
>>> pl.Series([1, 2, 3]) + np.array([[1, 2, 3]])
thread '<unnamed>' panicked at crates/polars-core/src/series/series_trait.rs:147:13:
`add` operation not supported for dtype `list[i64]`
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "polars/py-polars/polars/series/series.py", line 1028, in __add__
    return self._arithmetic(other, "add", "add_<>")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "polars/py-polars/polars/series/series.py", line 968, in _arithmetic
    return self._from_pyseries(getattr(self._s, op_s)(Series(other)._s))
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: `add` operation not supported for dtype `list[i64]`
  1. Series + 0D array:
>>> pl.Series([1, 2, 3]) + np.array(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "polars/py-polars/polars/series/series.py", line 1028, in __add__
    return self._arithmetic(other, "add", "add_<>")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "polars/py-polars/polars/series/series.py", line 968, in _arithmetic
    return self._from_pyseries(getattr(self._s, op_s)(Series(other)._s))
                                                      ^^^^^^^^^^^^^
  File "polars/py-polars/polars/series/series.py", line 308, in __init__
    self._s = numpy_to_pyseries(
              ^^^^^^^^^^^^^^^^^^
  File "polars/py-polars/polars/utils/_construction.py", line 256, in numpy_to_pyseries
    return PySeries.new_object(name, values, strict)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: argument 'val': iteration over a 0-d array

Log output

No response

Issue description

pl.DataFrame/pl.Series + np.array is broken.

Expected behavior

All of these should be allowed.

Installed versions

``` --------Version info--------- Polars: 0.20.5 Index type: UInt32 Platform: Linux-4.4.0-22621-Microsoft-x86_64-with-glibc2.35 Python: 3.12.1 | packaged by conda-forge | (main, Dec 23 2023, 08:03:24) [GCC 12.3.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: deltalake: 0.14.0 fsspec: 2023.12.2 gevent: 23.9.1 hvplot: 0.9.1 matplotlib: 3.8.2 numpy: 1.26.2 openpyxl: pandas: 2.1.4 pyarrow: 14.0.2 pydantic: 2.5.3 pyiceberg: 0.5.1 pyxlsb: sqlalchemy: 2.0.23 xlsx2csv: xlsxwriter: 3.1.9 ```
alexander-beedie commented 9 months ago

This appears to be a feature request for additional numpy interop rather than a bug ;)

Wainberg commented 9 months ago

Touché ;)

But honestly, I'd still classify it as a bug because this works:

>>> pl.Series([1, 2, 3]) + np.array([1, 2, 3])
shape: (3,)
Series: '' [i64]
[
        2
        4
        6
]

and all operations with NumPy array + Series/DataFrame work (by auto-converting the RHS to a NumPy array). So it's really just these 5 cases that give an error when they shouldn't.

Wainberg commented 9 months ago

The way I implemented this in https://github.com/pola-rs/polars/pull/12426 is by editing _prepare_other_arg, as discussed here: https://github.com/pola-rs/polars/issues/14080.

Wainberg commented 9 months ago

Note that operations involving NumPy arrays could be made commutative (so that np.array + pl.DataFrame gives a pl.DataFrame rather than an np.array, to match pl.DataFrame + np.array) by overriding __array_ufunc__, as implemented in https://github.com/pola-rs/polars/pull/12426:

For Series:

_operator_ufuncs: ClassVar[dict[np.ufunc, tuple[str, str]]] = {
    np.equal: ("__eq__", "__eq__"),
    np.not_equal: ("__ne__", "__ne__"),
    np.greater: ("__gt__", "__lt__"),
    np.greater_equal: ("__ge__", "__le__"),
    np.less: ("__lt__", "__gt__"),
    np.less_equal: ("__le__", "__ge__"),
    np.add: ("__add__", "__radd__"),
    np.subtract: ("__sub__", "__rsub__"),
    np.multiply: ("__mul__", "__rmul__"),
    np.divide: ("__truediv__", "__rtruediv__"),
    np.true_divide: ("__truediv__", "__rtruediv__"),
    np.floor_divide: ("__floordiv__", "__rfloordiv__"),
    np.power: ("__pow__", "__rpow__"),
    np.remainder: ("__mod__", "__rmod__"),
    np.mod: ("__mod__", "__rmod__"),
    np.bitwise_and: ("__and__", "__rand__"),
    np.bitwise_or: ("__or__", "__ror__"),
    np.bitwise_xor: ("__xor__", "__rxor__"),
    np.matmul: ("__matmul__", "__rmatmul__"),
}

def __array_ufunc__(
    self,
    ufunc: np.ufunc,
    method: Literal[
        "__call__", "reduce", "reduceat", "accumulate", "outer", "inner"
    ],
    *inputs: Any,
    **kwargs: Any,
) -> Series:
    """Numpy universal functions."""
    if self._s.n_chunks() > 1:
        self._s.rechunk(in_place=True)

    s = self._s

    if method == "__call__":
        # For ufuncs that correspond to operators, delegate to the polars
        # implementation of those operators. This ensures operators are
        # commutative, i.e. that they have the same behavior regardless of
        # whether the NumPy array is the left-hand or the right-hand
        # operand. It also ensures correct broadcasting of 2D NumPy arrays
        # with polars Series.
        if ufunc in self._operator_ufuncs:
            if self is inputs[0]:
                # self is left-hand argument
                return getattr(self, self._operator_ufuncs[ufunc][0])(inputs[1])
            else:
                # self is right-hand argument
                return getattr(self, self._operator_ufuncs[ufunc][1])(inputs[0])
    ...

and for DataFrame:

    def __array_ufunc__(
        self,
        ufunc: np.ufunc,
        method: Literal[
            "__call__", "reduce", "reduceat", "accumulate", "outer", "inner"
        ],
        *inputs: Any,
        **kwargs: Any,
    ) -> Self:
        """Numpy universal functions."""
        if method == "__call__" and ufunc in self._operator_ufuncs:
            # For ufuncs that correspond to operators, delegate to the polars
            # implementation of those operators. This ensures operators are
            # commutative, i.e. that they have the same behavior regardless of
            # whether the NumPy array is the left-hand or the right-hand
            # operand.
            if self is inputs[0]:
                # self is left-hand argument
                return getattr(self, self._operator_ufuncs[ufunc][0])(inputs[1])
            else:
                # self is right-hand argument
                return getattr(self, self._operator_ufuncs[ufunc][1])(inputs[0])
        else:
            # Just do the default thing: call the ufunc. Call __array__() on
            # each argument first to avoid infinite recursion - see
            # github.com/numpy/numpy/issues/9079#issuecomment-300279535.
            return getattr(ufunc, method)(
                *(arr.__array__() for arr in inputs), **kwargs
            )