pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.11k stars 1.83k forks source link

Arithmetic with nested arrays gives wrong results #17820

Open itamarst opened 1 month ago

itamarst commented 1 month ago

Checks

Reproducible example

import polars as pl
import numpy as np

print("This is correct result:")
nested = pl.Series("a", np.array([[1, 2], [3, 4]]))
print(nested + nested)

print("This is wrong result:")
nested2 = pl.Series("a", np.array([[[1, 2]], [[3, 4]]]))
print(nested2 + nested2)

The output:

This is correct result:
shape: (2,)
Series: 'a' [array[i64, 2]]
[
        [2, 4]
        [6, 8]
]
This is wrong result:
shape: (4,)
Series: 'a' [array[i64, 1]]
[
        [2]
        [4]
        [6]
        [8]
]

Log output

No response

Issue description

Array addition, newly added, works correctly with a single pl.Array(pl.Int64, 2). If you have nested arrays, however, it gives the wrong result.

The cause is that array arithmetic relies on get_leaf_array(), and then doesn't restore to the correct shape.

Expected behavior

The expected result for nesting2 + nesting2 should be:

shape: (2,)
Series: 'a' [array[i64, (1, 2)]]
[
        [[2, 4]]
        [[6, 8]]
]

Installed versions

``` --------Version info--------- Polars: 1.2.1 Index type: UInt32 Platform: Linux-6.9.3-76060903-generic-x86_64-with-glibc2.35 Python: 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0] ----Optional dependencies---- adbc_driver_manager: 1.0.0 cloudpickle: 3.0.0 connectorx: 0.3.3 deltalake: 0.17.4 fastexcel: 0.10.4 fsspec: 2024.5.0 gevent: 24.2.1 great_tables: hvplot: 0.10.0 matplotlib: 3.9.0 nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: 3.1.2 pandas: 2.2.2 pyarrow: 16.1.0 pydantic: 2.7.1 pyiceberg: 0.6.1 sqlalchemy: 2.0.30 torch: 2.3.0+cpu xlsx2csv: 0.8.2 xlsxwriter: 3.2.0 ```
ritchie46 commented 1 month ago

Ai, can you take this one @itamarst. It needs some more love, as it also doesn't broadcast yet.

itamarst commented 1 month ago

I'm definitely going to look into broadcasting next, and at minimum I can make this give a nice error message instead of wrong results.