Open aldanor opened 11 months ago
Ok, edit, after a bit of reading...
IntoPy:
basically, creates a pa.Array
via pa.Array._import_from_c
. In this case, if we have multiple chunks, can we simply do that for each chunk and then call pa.chunked_array(chunks)
?
pl.from_arrow()
seems to squash pa.ChunkedArray
somewhere along the way anyways...FromPyObject
: more important but a bit more obscure:
Series.to_arrow()
which always returns contiguous result
PySeries::to_arrow()
calls self.rechunk(true)
DataFrame.to_arrow()
, however, returns a chunked Table
(!)
FromPyObject
for PyDataFrame
doesn't/can't use it and instead does it series-by-series so you end up with rechunked columns anywaysWe should return ChunkedArrays's
to arrow. That we don't probably was a bit of lazyness when I implemented this.
We should return ChunkedArrays's to arrow.
To arrow or from arrow? 🙂 (i.e. IntoPy or FromPyObject?)
If I'm reading it right btw, chunked-array API is not part of arrow's stable C API, is that part of the problem here?
// Yea, in some cases, this kind of rechunking may be catastrophic, e.g. if your dataframes are 50-100 GB, rechunk is the last thing you want to happen behind the scenes...
But we could return a list of arrow arrays. 🤔 And then even use that to create a pyarrow ChunkedArray.
Yea, I think that should work.
Also a question then whether a single-chunk case should be special-cased or not (should it yield a list of one and produce a chunked array with a single batch, or a plain array)
I think we can add a rechunk
parameter and return an array if rechunk=True
and otherwise always a list of arrays.
That sounds reasonable. The default being no rechunking?
That sounds reasonable. The default being no rechunking?
Yes. Default to not exploding your memory. 🙈
This took me a while to figure out (since this was the last place I'd expect a forced rechunk to happen) - while passing huge frames from Python to Rust and back, noticed that they end up arriving in one chunk even if they were multi-chunked originally.
Is there any reason to not leave rechunking to the end-user? (since in some cases it may end up being very detrimental)
https://github.com/pola-rs/pyo3-polars/blob/0165cb4f6795bf9a08fe04473aba408a02119b91/pyo3-polars/src/lib.rs#L121
... and also this: https://github.com/pola-rs/pyo3-polars/blob/0165cb4f6795bf9a08fe04473aba408a02119b91/pyo3-polars/src/lib.rs#L163