kuzudb / kuzu

Embeddable property graph database management system built for query speed and scalability. Implements Cypher.
https://kuzudb.com/
MIT License
1.38k stars 97 forks source link

Arrow chunk_size as keyword argument #3084

Closed prrao87 closed 7 months ago

prrao87 commented 7 months ago

Closes #2998.

As per @ray6080's comment, setting the arrow record batch size (which we call chunk_size) to 1M is a reasonable default because DuckDB does the same. In the majority of cases, a larger record batch size is favourable, and the user can always bring down the chunk_size if necessary.

I also fixed the pyarrow tests to not specify the chunk_size argument as low integer values.

prrao87 commented 7 months ago

@mewim what do you think about moving the get_as_pl adaptive chunk size estimation logic for polars over to the get_as_arrow method? That way Polars could just benefit from the arrow logic directly and not have to specify it independently.

codecov[bot] commented 7 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 92.71%. Comparing base (f8fe205) to head (638763b).

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #3084 +/- ## ======================================= Coverage 92.71% 92.71% ======================================= Files 1162 1162 Lines 43140 43140 ======================================= + Hits 39997 39999 +2 + Misses 3143 3141 -2 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

prrao87 commented 7 months ago

moving the get_as_pl adaptive chunk size estimation logic for polars over to the get_as_arrow method

I just made those changes and built & tested locally, works well 👌🏽.

alexander-beedie commented 7 months ago

fixed the pyarrow tests to not specify the chunk_size argument as low integer values.

FYI: I believe they were specified this way in order to validate result acquisition across chunk boundaries.

Also the meaning of chunk_size changed here - the docs say it refers to the number of rows, but by moving the logic from inside the Polars function into the Arrow one and exposing it on the existing parameter, the actual row count is now only going to match chunk_size if you have a single column 😅

As per https://github.com/kuzudb/kuzu/issues/2998#issuecomment-1986014465, you could instead allow opt-in to the adaptive behaviour by supporting a chunk_size=None option, otherwise maintaining the existing behaviour (where chunk_size=n_rows).

This would allow for adaptive behaviour without changing the meaning of integer chunk_size (so not a breaking change). Indeed, this could even be the new default - as the parameter currently has to be specified, you could change the default to chunk_size=None (eg: adaptive by default) without any existing code being impacted 👍

prrao87 commented 7 months ago

Hi @alexander-beedie, great points.

  1. It makes sense to revert to the small chunk size numbers in the tests so as to validate result acquisition across chunk boundaries as you mentioned
  2. For the opt-in behaviour for get_as_pl as you mentioned (where chunk_size=None, we would still perform adaptive logic in the get_as_arrow method, correct?
  3. I'm not sure what you meant by the last comment (the "new default") not being a breaking change. Do you mean breaking change w.r.t the old approach that you committed, or w.r.t. this new way where we do the adaptive logic inside get_as_arrow?

In any case, I think we're converging towards an agreeable and better solution, so thanks again :)

mewim commented 7 months ago

I would probably keep the chunk size as referring to number of rows and use adaptive chunk size only if chunk size is set to None / 0 / -1.

prrao87 commented 7 months ago

@mewim I reworked it according to your latest comment. What do you think?

alexander-beedie commented 7 months ago

I would probably keep the chunk size as referring to number of rows and use adaptive chunk size only if chunk size is set to None / 0 / -1.

I'd reserve chunk_size = 0/-1 for a new mode that guarantees only a single chunk (aka: no chunks?) is produced. This allows None to be adaptive, an integer to continue doing what it currently does, and if/when the functionality is added lower down to produce an unchunked Arrow result we could enable 0/-1 for that mode 🤔

I'm not sure what you meant by the last comment (the "new default") not being a breaking change.

Just that the way the PR was initially written changed the meaning of chunk_size so that everyone currently using it would start getting back results that were not chunked the way that they expected (which would be a breaking change). If the new default was None then all current usage would continue to behave identically (so the change in default would not be breaking); only new usage that omits an integer chunk_size would get the new behaviour. Allows for a seamless/non-breaking transition.

prrao87 commented 7 months ago

@alexander-beedie how does -1 produce an "unchunked" result? To my understanding arrow will always return record batches, i.e., chunks of records? The cases where it's None or a positive integer make sense.

Could you maybe post a snippet of how you'd use it here?

alexander-beedie commented 7 months ago

@alexander-beedie how does -1 produce an "unchunked" result? To my understanding arrow will always return record batches, i.e., chunks of records?

It doesn't/can't at the moment as the current low-level code will always produce chunked results (the adaptive mode improves the situation, but isn't a guarantee). The idea would be to always produce a single chunk that represents the complete result set.

We'd use such a mode in Polars, as otherwise we typically rechunk Arrow data that arrives with n_chunks > 1 (it's more optimal for us to operate on contiguous data).

prrao87 commented 7 months ago

Hmm, so it's a polars-specific setting where we'd document that setting chunk_size as -1 is an option. But as per the polars docs for the from_arrow method, this would require that we specify rechunk=False when the user specifies -1 to get_as_pl, correct?

prrao87 commented 7 months ago

I think c26fc74 addresses your comments @alexander-beedie. We can use the get_num_tuples() method to return the entire results as a single chunk when the user specifies 0 or -1.

This will have to be documented carefully, though!

alexander-beedie commented 7 months ago

as per the polars docs for the from_arrow method, this would require that we specify rechunk=False when the user specifies -1 to get_as_pl, correct?

It's a no-op unless n_chunks > 1, so no need to set explicitly; can leave as-is.

mewim commented 7 months ago

@prrao87 Note that when merging to the master, the commits needs to be squashed into one.

prrao87 commented 7 months ago

Shall I go ahead and squash-merge?

mewim commented 7 months ago

Yeah I think it is ready to merge

alexander-beedie commented 7 months ago

We can use the get_num_tuples() method to return the entire results as a single chunk

@prrao87: Somehow I completely missed this method; great solution ✌️😎