kuzudb / kuzu

Embeddable property graph database management system built for query speed and scalability. Implements Cypher.
https://kuzudb.com/
MIT License
1.38k stars 97 forks source link

Make `chunk_size` in `get_as_arrow` an optional keyword argument #2998

Closed prrao87 closed 7 months ago

prrao87 commented 7 months ago

Currently, chunk_size is a positional argument in Python. This requires a new user to experience a runtime error before realizing this fact. Pandas has no such requirement (get_as_df has no arguments). We could instead make the chunk_size an optional kwarg, and set it to a default of 10000.

In fact, this is what Polars has done as per this discussion. What do you think about this @ray6080? It won't break any existing functionality and 10000 seems like a reasonable default.

prrao87 commented 7 months ago

Alternate thought: is the chunk_size argument truly necessary? This is worth thinking so that the Arrow, Pandas and Polars methods are all just as simple from a user perspective.

ray6080 commented 7 months ago

Should've given that a better name. It is intended to set rows_per_record_batch, as we internally pull data into record batches, which form the table. I think it is still reasonable to give users the flexibility to set the size of record batches, but happy to hear more feedback on this. (at least we should provide a default value. that was a careless decision when I first worked on it 😅)

Just for reference, I took a look at the similar interface in DuckDB:

fetch_arrow_table(self: duckdb.duckdb.DuckDBPyConnection, rows_per_batch: int = 1000000) → pyarrow.lib.Table
prrao87 commented 7 months ago

No worries, I think the getAsArrow parameter from C++ spilled over into Python too :). Renaming the parameter would introduce breaking changes, so do you think it makes sense to leave it as it is, but make the Python version a keyword argument with a default value of 1M just like DuckDB?

alexander-beedie commented 7 months ago

In fact, this is what Polars has done as per this discussion. What do you think about this @ray6080? It won't break any existing functionality and 10000 seems like a reasonable default.

FYI: I only did this because there is currently no option to get back a single chunk (which we usually prefer, otherwise we need to rechunk the returned data when n_chunks > 1).

Also, in #3009 (a PR I made a few minutes ago) I have actually made the polars chunk size adaptive, targeting ~10m elements per chunk to improve the chances that we don't have to rechunk. As a general rule it's a better idea to target the total number of elements being loaded rather than the number of rows, because you could have 1 column or 1000 columns; that would be a three orders of magnitude difference in how much you're loading (and allocating) if you only target n_rows without reference to n_cols ;)

alexander-beedie commented 7 months ago

I'd suggest a default of None, triggering an adaptive chunk_size (similar to the PR referenced above, but targeting a slightly less aggressive 1m elements, perhaps?), and additionally allow -1 for no chunking (which I'd immediately switch to for the polars frame export). Any other positive integer would continue the current behaviour (so it would be a zero-breakage change).

prrao87 commented 7 months ago

Looks great overall! The adaptive chunk size makes a lot of sense from a polars and python perspective, but the underlying get_as_arrow method would also need to serve other use cases like a future arrow-only backend for pandas, so the default chunk size of 1M (as DuckDB does) makes sense for now, I think? The C++ API requires this parameter, and it makes sense to be as consistent across the APIs for arrow conversion as possible.

@ray6080 and @mewim will have more thoughts on this too, so will leave it to them.

alexander-beedie commented 7 months ago

The C++ API requires this parameter, and it makes sense to be as consistent across the APIs for arrow conversion as possible.

Yup; the parameter would stay the same, you'd just use the number of columns (which I believe would always be known when this is called) to dynamically influence the value you pass to it (if not already explicitly set).

I've long advocated against APIs that target n_rows, ever since seeing a real-life case back in my old job where the number of columns returned could vary between one and several thousand, but the chunk size (given as n_rows) remained the same - and people were surprised when some result sets would crash because sufficient memory could not be allocated :))

mewim commented 7 months ago

Yeah I think targeting number of values makes sense, we should have a way for the user to pass in that and also have a way of returning everything as a single chunk.