Open ruiyang2015 opened 1 month ago
It's possible that we're not encoding the chunk size in the right way. I recall there being some complexity around how paged results are related to chunks.
Maybe @tswast knows: is it possible to get back an exact chunk size (modulo the last chunk which will be <= the requested chunk size)?
This has to be done via the page_size
parameter on QueryJob.result
or query_and_wait
.
As far as I can tell we're doing this correctly in Ibis.
A few things to note: I believe setting a page size currently means one must use the BigQuery REST API. (Though I'm having trouble confirming this in https://github.com/googleapis/python-bigquery/blob/7372ad659fd3316a602e90f224e9a3304d4c1419/google/cloud/bigquery/table.py#L1699) The REST API treats the page size as a maximum size not really a target. In my experience, where possible the BigQuery REST API caps at around 10 MB pages, unless the row size is such that larger responses are needed.
Even if we could use the BQ Storage Read API it's not really possible to tune the message size with that either. A read session can configure the number of streams, but not the size of messages in each stream.
Would it make sense for Ibis to do some client-size grouping to respect this parameter? I've viewed page size / chunk size as more of a tuning parameter, so it might be misleading to return chunks larger than what the backend can provide.
It might be useful to do that, but it seems like it could be counterproductive to incur the overhead of re batching (though I'm not sure how expensive that is).
What happened?
for following code for bigquery:
for our case, the record returned is always 4k rows instead of larger set we expect. tried same code for Duckdb/Snowflake, both can return proper sized pyarrow table
What version of ibis are you using?
9.0.0
What backend(s) are you using, if any?
BigQuery
Relevant log output
No response
Code of Conduct