Closed prrao87 closed 7 months ago
@mewim what do you think about moving the get_as_pl
adaptive chunk size estimation logic for polars over to the get_as_arrow
method? That way Polars could just benefit from the arrow logic directly and not have to specify it independently.
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 92.71%. Comparing base (
f8fe205
) to head (638763b
).
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
moving the get_as_pl adaptive chunk size estimation logic for polars over to the get_as_arrow method
I just made those changes and built & tested locally, works well 👌🏽.
fixed the pyarrow tests to not specify the
chunk_size
argument as low integer values.
FYI: I believe they were specified this way in order to validate result acquisition across chunk boundaries.
Also the meaning of chunk_size
changed here - the docs say it refers to the number of rows, but by moving the logic from inside the Polars function into the Arrow one and exposing it on the existing parameter, the actual row count is now only going to match chunk_size
if you have a single column 😅
As per https://github.com/kuzudb/kuzu/issues/2998#issuecomment-1986014465, you could instead allow opt-in to the adaptive behaviour by supporting a chunk_size=None
option, otherwise maintaining the existing behaviour (where chunk_size=n_rows
).
This would allow for adaptive behaviour without changing the meaning of integer chunk_size
(so not a breaking change). Indeed, this could even be the new default - as the parameter currently has to be specified, you could change the default to chunk_size=None
(eg: adaptive by default) without any existing code being impacted 👍
Hi @alexander-beedie, great points.
get_as_pl
as you mentioned (where chunk_size=None
, we would still perform adaptive logic in the get_as_arrow
method, correct?get_as_arrow
?In any case, I think we're converging towards an agreeable and better solution, so thanks again :)
I would probably keep the chunk size as referring to number of rows and use adaptive chunk size only if chunk size is set to None / 0 / -1.
@mewim I reworked it according to your latest comment. What do you think?
I would probably keep the chunk size as referring to number of rows and use adaptive chunk size only if chunk size is set to None / 0 / -1.
I'd reserve chunk_size = 0/-1
for a new mode that guarantees only a single chunk (aka: no chunks?) is produced. This allows None
to be adaptive, an integer to continue doing what it currently does, and if/when the functionality is added lower down to produce an unchunked Arrow result we could enable 0/-1 for that mode 🤔
I'm not sure what you meant by the last comment (the "new default") not being a breaking change.
Just that the way the PR was initially written changed the meaning of chunk_size
so that everyone currently using it would start getting back results that were not chunked the way that they expected (which would be a breaking change). If the new default was None
then all current usage would continue to behave identically (so the change in default would not be breaking); only new usage that omits an integer chunk_size
would get the new behaviour. Allows for a seamless/non-breaking transition.
@alexander-beedie how does -1 produce an "unchunked" result? To my understanding arrow will always return record batches, i.e., chunks of records? The cases where it's None
or a positive integer make sense.
Could you maybe post a snippet of how you'd use it here?
@alexander-beedie how does -1 produce an "unchunked" result? To my understanding arrow will always return record batches, i.e., chunks of records?
It doesn't/can't at the moment as the current low-level code will always produce chunked results (the adaptive mode improves the situation, but isn't a guarantee). The idea would be to always produce a single chunk that represents the complete result set.
We'd use such a mode in Polars, as otherwise we typically rechunk Arrow data that arrives with n_chunks > 1 (it's more optimal for us to operate on contiguous data).
Hmm, so it's a polars-specific setting where we'd document that setting chunk_size
as -1 is an option. But as per the polars docs for the from_arrow
method, this would require that we specify rechunk=False
when the user specifies -1 to get_as_pl
, correct?
I think c26fc74 addresses your comments @alexander-beedie. We can use the get_num_tuples()
method to return the entire results as a single chunk when the user specifies 0 or -1.
This will have to be documented carefully, though!
as per the polars docs for the
from_arrow
method, this would require that we specifyrechunk=False
when the user specifies -1 toget_as_pl
, correct?
It's a no-op unless n_chunks > 1, so no need to set explicitly; can leave as-is.
@prrao87 Note that when merging to the master, the commits needs to be squashed into one.
Shall I go ahead and squash-merge?
Yeah I think it is ready to merge
We can use the
get_num_tuples()
method to return the entire results as a single chunk
@prrao87: Somehow I completely missed this method; great solution ✌️😎
Closes #2998.
As per @ray6080's comment, setting the arrow record batch size (which we call
chunk_size
) to 1M is a reasonable default because DuckDB does the same. In the majority of cases, a larger record batch size is favourable, and the user can always bring down thechunk_size
if necessary.I also fixed the pyarrow tests to not specify the
chunk_size
argument as low integer values.