Closed prrao87 closed 7 months ago
Alternate thought: is the chunk_size
argument truly necessary? This is worth thinking so that the Arrow, Pandas and Polars methods are all just as simple from a user perspective.
Should've given that a better name. It is intended to set rows_per_record_batch
, as we internally pull data into record batches, which form the table. I think it is still reasonable to give users the flexibility to set the size of record batches, but happy to hear more feedback on this. (at least we should provide a default value. that was a careless decision when I first worked on it 😅)
Just for reference, I took a look at the similar interface in DuckDB:
fetch_arrow_table(self: duckdb.duckdb.DuckDBPyConnection, rows_per_batch: int = 1000000) → pyarrow.lib.Table
No worries, I think the getAsArrow
parameter from C++ spilled over into Python too :). Renaming the parameter would introduce breaking changes, so do you think it makes sense to leave it as it is, but make the Python version a keyword argument with a default value of 1M just like DuckDB?
In fact, this is what Polars has done as per this discussion. What do you think about this @ray6080? It won't break any existing functionality and 10000 seems like a reasonable default.
FYI: I only did this because there is currently no option to get back a single chunk (which we usually prefer, otherwise we need to rechunk the returned data when n_chunks > 1).
Also, in #3009 (a PR I made a few minutes ago) I have actually made the polars chunk size adaptive, targeting ~10m elements per chunk to improve the chances that we don't have to rechunk. As a general rule it's a better idea to target the total number of elements being loaded rather than the number of rows, because you could have 1 column or 1000 columns; that would be a three orders of magnitude difference in how much you're loading (and allocating) if you only target n_rows without reference to n_cols ;)
I'd suggest a default of None
, triggering an adaptive chunk_size
(similar to the PR referenced above, but targeting a slightly less aggressive 1m elements, perhaps?), and additionally allow -1
for no chunking (which I'd immediately switch to for the polars frame export). Any other positive integer would continue the current behaviour (so it would be a zero-breakage change).
chunk_size = None
(adaptive)chunk_size = <n>
(current behaviour)chunk_size = -1
(no chunking, eg: return a single chunk)Looks great overall! The adaptive chunk size makes a lot of sense from a polars and python perspective, but the underlying get_as_arrow
method would also need to serve other use cases like a future arrow-only backend for pandas, so the default chunk size of 1M (as DuckDB does) makes sense for now, I think? The C++ API requires this parameter, and it makes sense to be as consistent across the APIs for arrow conversion as possible.
@ray6080 and @mewim will have more thoughts on this too, so will leave it to them.
The C++ API requires this parameter, and it makes sense to be as consistent across the APIs for arrow conversion as possible.
Yup; the parameter would stay the same, you'd just use the number of columns (which I believe would always be known when this is called) to dynamically influence the value you pass to it (if not already explicitly set).
I've long advocated against APIs that target n_rows, ever since seeing a real-life case back in my old job where the number of columns returned could vary between one and several thousand, but the chunk size (given as n_rows) remained the same - and people were surprised when some result sets would crash because sufficient memory could not be allocated :))
Yeah I think targeting number of values makes sense, we should have a way for the user to pass in that and also have a way of returning everything as a single chunk.
Currently,
chunk_size
is a positional argument in Python. This requires a new user to experience a runtime error before realizing this fact. Pandas has no such requirement (get_as_df
has no arguments). We could instead make thechunk_size
an optional kwarg, and set it to a default of 10000.In fact, this is what Polars has done as per this discussion. What do you think about this @ray6080? It won't break any existing functionality and 10000 seems like a reasonable default.