pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
26.9k stars 1.65k forks source link

fix(rust,python): Harden `Series.reshape` against invalid parameters #16281

Closed datenzauberai closed 2 weeks ago

datenzauberai commented 2 weeks ago

This is a fix for https://github.com/pola-rs/polars/issues/15543

Improvements:

codecov[bot] commented 2 weeks ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 80.81%. Comparing base (11fe9d8) to head (a6e343f).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #16281 +/- ## ========================================== + Coverage 80.80% 80.81% +0.01% ========================================== Files 1393 1393 Lines 179406 179406 Branches 2921 2921 ========================================== + Hits 144971 144989 +18 + Misses 33932 33914 -18 Partials 503 503 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

datenzauberai commented 2 weeks ago

While pl.List gets first exploded and is reshaped according to the spec...

print(pl.DataFrame(pl.Series(name="a", values=[[1, 2, 3, 4]], dtype=pl.List(pl.Int64))).select(pl.col("a").reshape((2,2))))
print(pl.DataFrame(pl.Series(name="a", values=[[1], [2], [3], [4]], dtype=pl.List(pl.Int64))).select(pl.col("a").reshape((2,2))))
┌───────────┐
│ a         │
│ ---       │
│ list[i64] │
╞═══════════╡
│ [1, 2]    │
│ [3, 4]    │
└───────────┘

pl.Array is treated differently right now more like a scalar datatype...

print(pl.DataFrame(pl.Series(name="a", values=[[1, 2, 3, 4]], dtype=pl.Array(pl.Int64, width=4))).select(pl.col("a").reshape((2, 2))))
ComputeError: cannot reshape len 1 into shape [2, 2]
print(pl.DataFrame(pl.Series(name="a", values=[[1], [2], [3], [4]], dtype=pl.Array(pl.Int64, width=1))).select(pl.col("a").reshape((2, 2))))
┌─────────────────────┐
│ a                   │
│ ---                 │
│ list[array[i64, 1]] │
╞═════════════════════╡
│ [[1], [2]]          │
│ [[3], [4]]          │
└─────────────────────┘

We could explode pl.Array(inner_dtype) as well and then return pl.List(inner_dtype), but one could also argue that not exploding is how it should work and makes sense if one uses pl.Array for struct-like things.

pl.Array would even be a much more natural output type than pl.List for this operation, but it would not be able to handle empty series (no pl.Array(width=0)) and I guess that's too much of a change...