Open bellati opened 1 month ago
A workaround I am using for now is casting the lists to arrays:
def mean_array_loop(colname, dims):
return pl.col(colname).cast(pl.List(pl.Array(pl.Float64, dims))).map_elements(mean_array, return_dtype=pl.List(pl.Float64))
It seems the culprit is on .to_numpy()
, not .map_elements
call. Selecting a cell (I mean, the point value of the dataframe) of the list of list column already makes a broken series of dtype list:
>>> _df = input_df.group_by(
... "index"
... ).agg(
... vectors=pl.col("vector")
... )
shape: (2, 2)
┌───────┬─────────────────────────────────┐
│ index ┆ vectors │
│ --- ┆ --- │
│ i32 ┆ list[list[f64]] │
╞═══════╪═════════════════════════════════╡
│ 0 ┆ [[1.0, 3.0], [5.0, 2.0], [10.0… │
│ 1 ┆ [[9.0, 3.0], [3.0, 2.0]] │
└───────┴─────────────────────────────────┘
>>> _df[0, "vectors"].to_numpy()
array([array([1., 3.]), array([5., 2.]), array([10., 7.])], dtype=object) # Correct
>>> _df[1, "vectors"].to_numpy()
array([array([1., 3.]), array([5., 2.])], dtype=object) # ???? the array is slicing the first 2 rows, not the last 2 rows
Passing to pyarrow
's .to_numpy
bypass this issue:
>>> _df[0, "vectors"].to_arrow().to_numpy(zero_copy_only=False)
array([array([1., 3.]), array([5., 2.]), array([10., 7.])], dtype=object)
>>> _df[1, "vectors"].to_arrow().to_numpy(zero_copy_only=False)
array([array([9., 3.]), array([3., 2.])], dtype=object)
Checks
Reproducible example
Log output
No response
Issue description
What was found out ?
There seems to be a problem in version
0.20.31
but not in version0.20.2
where calls toto_numpy()
return weird versions of the originalSeries
and the aggregation I need returns wrong results. This seems to only happen when theinput_df
vector schema is apl.List
- the issue seems to be resolved if instead of a list one uses apl.Array(pl.Float64, 2)
.The behaviour also is non-deterministic.
Background
I came across this when trying to calculate the average vector
Example, say I have the following schema:
I wanted to group by
user_id
and take the element-wise average of the elements in the vector.The result should be:
Expected behavior
The expected behavior is that
to_numpy()
returns a correct numpy representation of the row in the series.Installed versions