pola-rs / pyo3-polars

Plugins/extension for Polars
MIT License
238 stars 39 forks source link

Forced rechunking #32

Open aldanor opened 11 months ago

aldanor commented 11 months ago

This took me a while to figure out (since this was the last place I'd expect a forced rechunk to happen) - while passing huge frames from Python to Rust and back, noticed that they end up arriving in one chunk even if they were multi-chunked originally.

Is there any reason to not leave rechunking to the end-user? (since in some cases it may end up being very detrimental)

https://github.com/pola-rs/pyo3-polars/blob/0165cb4f6795bf9a08fe04473aba408a02119b91/pyo3-polars/src/lib.rs#L121

... and also this: https://github.com/pola-rs/pyo3-polars/blob/0165cb4f6795bf9a08fe04473aba408a02119b91/pyo3-polars/src/lib.rs#L163

aldanor commented 11 months ago

Ok, edit, after a bit of reading...

IntoPy: basically, creates a pa.Array via pa.Array._import_from_c. In this case, if we have multiple chunks, can we simply do that for each chunk and then call pa.chunked_array(chunks)?

FromPyObject: more important but a bit more obscure:

ritchie46 commented 11 months ago

We should return ChunkedArrays's to arrow. That we don't probably was a bit of lazyness when I implemented this.

aldanor commented 11 months ago

We should return ChunkedArrays's to arrow.

To arrow or from arrow? 🙂 (i.e. IntoPy or FromPyObject?)

If I'm reading it right btw, chunked-array API is not part of arrow's stable C API, is that part of the problem here?

// Yea, in some cases, this kind of rechunking may be catastrophic, e.g. if your dataframes are 50-100 GB, rechunk is the last thing you want to happen behind the scenes...

ritchie46 commented 11 months ago

But we could return a list of arrow arrays. 🤔 And then even use that to create a pyarrow ChunkedArray.

aldanor commented 11 months ago

Yea, I think that should work.

Also a question then whether a single-chunk case should be special-cased or not (should it yield a list of one and produce a chunked array with a single batch, or a plain array)

ritchie46 commented 11 months ago

I think we can add a rechunk parameter and return an array if rechunk=True and otherwise always a list of arrays.

aldanor commented 10 months ago

That sounds reasonable. The default being no rechunking?

ritchie46 commented 10 months ago

That sounds reasonable. The default being no rechunking?

Yes. Default to not exploding your memory. 🙈