ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.31k stars 598 forks source link

bug: BigQuery backend table.to_pyarrow_batches is not honoring the chunk_size parameter #10257

Open ruiyang2015 opened 1 month ago

ruiyang2015 commented 1 month ago

What happened?

for following code for bigquery:

c = ibis.bigquery.connect(...)
t = c.table('some table')
for y in t.to_pyarrow_batches(chunk_size=1_000_000):  # <- change this parameter does not take effect
  print(y.num_rows)

for our case, the record returned is always 4k rows instead of larger set we expect. tried same code for Duckdb/Snowflake, both can return proper sized pyarrow table

What version of ibis are you using?

9.0.0

What backend(s) are you using, if any?

BigQuery

Relevant log output

No response

Code of Conduct

cpcloud commented 1 month ago

It's possible that we're not encoding the chunk size in the right way. I recall there being some complexity around how paged results are related to chunks.

Maybe @tswast knows: is it possible to get back an exact chunk size (modulo the last chunk which will be <= the requested chunk size)?

tswast commented 1 month ago

This has to be done via the page_size parameter on QueryJob.result or query_and_wait.

As far as I can tell we're doing this correctly in Ibis.

https://github.com/ibis-project/ibis/blob/9f565a9ad98a089fcb25959a88136a6e7bc1c506/ibis/backends/bigquery/__init__.py#L802

https://github.com/ibis-project/ibis/blob/9f565a9ad98a089fcb25959a88136a6e7bc1c506/ibis/backends/bigquery/__init__.py#L680

A few things to note: I believe setting a page size currently means one must use the BigQuery REST API. (Though I'm having trouble confirming this in https://github.com/googleapis/python-bigquery/blob/7372ad659fd3316a602e90f224e9a3304d4c1419/google/cloud/bigquery/table.py#L1699) The REST API treats the page size as a maximum size not really a target. In my experience, where possible the BigQuery REST API caps at around 10 MB pages, unless the row size is such that larger responses are needed.

Even if we could use the BQ Storage Read API it's not really possible to tune the message size with that either. A read session can configure the number of streams, but not the size of messages in each stream.

tswast commented 1 month ago

Would it make sense for Ibis to do some client-size grouping to respect this parameter? I've viewed page size / chunk size as more of a tuning parameter, so it might be misleading to return chunks larger than what the backend can provide.

cpcloud commented 2 weeks ago

It might be useful to do that, but it seems like it could be counterproductive to incur the overhead of re batching (though I'm not sure how expensive that is).