pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.59k stars 1.89k forks source link

Cannot `dot` two `Array(float64, N)` columns #17456

Open rben01 opened 3 months ago

rben01 commented 3 months ago

Checks

Reproducible example

import polars as pl
import numpy as np

df = pl.DataFrame(
    {"a": [np.arange(3) for _ in range(2)], "b": [np.arange(3) + 10 for _ in range(2)]},
    schema={"a": pl.Array(pl.Float64, 3), "b": pl.Array(pl.Float64, 3)},
)

df.with_columns(pl.col("a").dot(pl.col("b")))

Log output

Traceback (most recent call last):
  File "/Users/rben01/file.py", line 24, in <module>
    df.with_columns(pl.col("a").dot(pl.col("b")))
  File "/Users/rben01/Library/Caches/pypoetry/virtualenvs/base-wTklQcO3-py3.12/lib/python3.12/site-packages/polars/dataframe/frame.py", line 8763, in with_columns
    return self.lazy().with_columns(*exprs, **named_exprs).collect(_eager=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/rben01/Library/Caches/pypoetry/virtualenvs/base-wTklQcO3-py3.12/lib/python3.12/site-packages/polars/lazyframe/frame.py", line 1942, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.InvalidOperationError: `sum` operation not supported for dtype `array[f64, 3]`

Issue description

I guess that after computing the elementwise product, Polars fails to take the sum of the resulting array?

Expected behavior

0 * 10 + 1 * 11 + 2 * 12 = 35

Installed versions

``` --------Version info--------- Polars: 1.0.0 Index type: UInt32 Platform: macOS-14.5-arm64-arm-64bit Python: 3.12.4 (main, Jun 6 2024, 18:26:44) [Clang 15.0.0 (clang-1500.3.9.4)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: great_tables: hvplot: matplotlib: 3.9.1 nest_asyncio: 1.6.0 numpy: 2.0.0 openpyxl: 3.1.5 pandas: 2.2.2 pyarrow: 16.1.0 pydantic: pyiceberg: sqlalchemy: 2.0.31 torch: xlsx2csv: xlsxwriter: ```
ritchie46 commented 3 months ago

More of a feature request. We don't support that yet, the array type is rather new.

rben01 commented 2 months ago

Thanks. Is there anyway to do row-wise inner product on List or Array columns? I've tried manually zipping by using indices as shown below, but I get ComputeError: named columns are not allowed in `list.eval`

df.with_columns(
    pl.int_ranges(pl.col("a").list.len()).list.eval(
        (pl.col("a").list.get(pl.element()) * pl.col("b").list.get(pl.element())).sum()
    )
)