pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.41k stars 1.77k forks source link

`to_numpy()` weird behaviour with `map_elements` and `pl.List` polars type #16897

Open bellati opened 1 month ago

bellati commented 1 month ago

Checks

Reproducible example

import polars as pl
import numpy as np

input_df = pl.DataFrame({
        "index": [0, 0, 0, 1, 1],
        "vector": [
            [1, 3],
            [5, 2],
            [10, 7],
            [9, 3],
            [3, 2]
        ]
    },
    schema={
        "index": pl.Int32,
        "vector": pl.List(pl.Float64)
    })

def mean_array(x):
    return pl.Series(np.mean(x.to_numpy(), axis=0))

def mean_array_loop(colname):
    return pl.col(colname).map_elements(mean_array, return_dtype=pl.List(pl.Float64))

def get_output_df():
    return input_df.group_by(
        "index"
    ).agg(
        vectors=pl.col("vector")
    ).with_columns(
        mean=mean_array_loop("vectors")
    )

get_output_df()
# ┌───────┬─────────────────────────────────┬─────────────────┐
# │ index ┆ vectors                         ┆ mean            │
# │ ---   ┆ ---                             ┆ ---             │
# │ i32   ┆ list[list[f64]]                 ┆ list[f64]       │
# ╞═══════╪═════════════════════════════════╪═════════════════╡
# │ 0     ┆ [[1.0, 3.0], [5.0, 2.0], [10.0… ┆ [5.333333, 4.0] │
# │ 1     ┆ [[9.0, 3.0], [3.0, 2.0]]        ┆ [3.0, 2.5]      │ <---- NOTE 3.0
# └───────┴─────────────────────────────────┴─────────────────┘

get_output_df()
# ┌───────┬─────────────────────────────────┬──────────────────────┐
# │ index ┆ vectors                         ┆ mean                 │
# │ ---   ┆ ---                             ┆ ---                  │
# │ i32   ┆ list[list[f64]]                 ┆ list[f64]            │
# ╞═══════╪═════════════════════════════════╪══════════════════════╡
# │ 1     ┆ [[9.0, 3.0], [3.0, 2.0]]        ┆ [6.0, 2.5]           │ <---- NOTE 6.0
# │ 0     ┆ [[1.0, 3.0], [5.0, 2.0], [10.0… ┆ [4.333333, 2.666667] │
# └───────┴─────────────────────────────────┴──────────────────────┘

Log output

No response

Issue description

What was found out ?

There seems to be a problem in version 0.20.31 but not in version 0.20.2 where calls to to_numpy() return weird versions of the original Series and the aggregation I need returns wrong results. This seems to only happen when the input_df vector schema is a pl.List - the issue seems to be resolved if instead of a list one uses a pl.Array(pl.Float64, 2).

The behaviour also is non-deterministic.

Background

I came across this when trying to calculate the average vector

Example, say I have the following schema:

user_id vector
0 [1,3]
0 [5,2]
0 [10,7]
1 [9,3]
1 [3,2]

I wanted to group by user_id and take the element-wise average of the elements in the vector.

The result should be:

user_id vector
0 [5.33,4]
1 [6,2.5]

Expected behavior

The expected behavior is that to_numpy() returns a correct numpy representation of the row in the series.

Installed versions

``` ----Optional dependencies---- adbc_driver_manager: 0.8.0 cloudpickle: 2.2.1 connectorx: 0.3.2 deltalake: 0.14.0 fastexcel: fsspec: 2023.12.2 gevent: 23.9.1 hvplot: 0.9.2 matplotlib: 3.6.0 nest_asyncio: 1.5.5 numpy: 1.23.3 openpyxl: pandas: 1.5.0 pyarrow: 9.0.0 pydantic: 2.5.2 pyiceberg: 0.5.1 pyxlsb: sqlalchemy: 1.4.41 torch: 2.3.0 xlsx2csv: 0.8.2 xlsxwriter: 3.1.9 ```
bellati commented 1 month ago

A workaround I am using for now is casting the lists to arrays:

def mean_array_loop(colname, dims):
    return pl.col(colname).cast(pl.List(pl.Array(pl.Float64, dims))).map_elements(mean_array, return_dtype=pl.List(pl.Float64))
cjackal commented 1 month ago

It seems the culprit is on .to_numpy(), not .map_elements call. Selecting a cell (I mean, the point value of the dataframe) of the list of list column already makes a broken series of dtype list:

>>> _df = input_df.group_by(
...     "index"
... ).agg(
...     vectors=pl.col("vector")
... )
shape: (2, 2)
┌───────┬─────────────────────────────────┐
│ index ┆ vectors                         │
│ ---   ┆ ---                             │
│ i32   ┆ list[list[f64]]                 │
╞═══════╪═════════════════════════════════╡
│ 0     ┆ [[1.0, 3.0], [5.0, 2.0], [10.0… │
│ 1     ┆ [[9.0, 3.0], [3.0, 2.0]]        │
└───────┴─────────────────────────────────┘

>>> _df[0, "vectors"].to_numpy()
array([array([1., 3.]), array([5., 2.]), array([10.,  7.])], dtype=object)  # Correct
>>> _df[1, "vectors"].to_numpy()
array([array([1., 3.]), array([5., 2.])], dtype=object)  # ???? the array is slicing the first 2 rows, not the last 2 rows

Passing to pyarrow's .to_numpy bypass this issue:

>>> _df[0, "vectors"].to_arrow().to_numpy(zero_copy_only=False)
array([array([1., 3.]), array([5., 2.]), array([10.,  7.])], dtype=object)
>>> _df[1, "vectors"].to_arrow().to_numpy(zero_copy_only=False)
array([array([9., 3.]), array([3., 2.])], dtype=object)