Make `max_stream_count` configurable when using Bigquery Storage API

kien-truong commented 1 month ago

Currently, for API that can use BQ Storage Client to fetch data like to_dataframe_iterable or to_arrow_iterable, the client library always uses the maximum number of read streams recommended by BQ server.

https://github.com/googleapis/python-bigquery/blob/ef8e92787941ed23b9b2b5ce7c956bcb3754b995/google/cloud/bigquery/_pandas_helpers.py#L840

https://github.com/googleapis/python-bigquery/blob/ef8e92787941ed23b9b2b5ce7c956bcb3754b995/google/cloud/bigquery/_pandas_helpers.py#L854-L858

This behavior has the advantage of maximizing throughput but can lead to out-of-memory issue when there are too many streams being opened and result are not read fast enough: we've encountered queries that open hundreds of streams and consuming GBs of memory.

BQ Storage Client API also suggests capping max_stream_count when resource is constrained

https://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1#createreadsessionrequest

Typically, clients should either leave this unset to let the system to determine an upper bound OR set this a size for the maximum "units of work" it can gracefully handle.

This problem has been encountered by others before and can be worked-around by monkey-patching the create_read_session on the BQ Client object: https://github.com/googleapis/python-bigquery/issues/1292

However, it should really be fixed by allowing the max_stream_count parameter to be set through public API.

kien-truong commented 1 month ago

Using the default setting, in the worst-case scenario, for n-download streams, we would have to store 2n result pages in memory:

1 result page inside each download thread, times n threads
n result page in the transfer queue between download threads and main thread

chalmerlowe commented 1 month ago

I am putting together a PR for this. It is not as simple as just adding a max_stream_count argument because historically the argument preserve_order has been used to make some choices regarding the number of streams to use.

We have some logic to allow those two to interact in a way that makes sense and hopefully doesn't break backward compatibility.

We are also issuing a docstring to explain the method as a whole (this feels necessary since we have a new argument that may interact with another argument).

More to come, soon.

kien-truong commented 1 month ago

Right, you can't use more than 1 stream if you need to preserve row order when using ORDER BY, the concurrency will screw up the output.

chalmerlowe commented 1 month ago

Alright. Got a draft PR. Still working on the unit tests. But I need to focus on something else right now. Will come back to this with a clear head.

chalmerlowe commented 1 month ago

@kien-truong let me know if the updates meet the need. The code is available in main and will included in the next release. (I don't have a date yet for the next release.)

kien-truong commented 1 month ago

Thanks @chalmerlowe, can you also add the max_stream_count argument to 2 methods to_dataframe_iterable and to_arrow_iterable?

https://github.com/googleapis/python-bigquery/blob/7372ad659fd3316a602e90f224e9a3304d4c1419/google/cloud/bigquery/table.py#L1811

https://github.com/googleapis/python-bigquery/blob/7372ad659fd3316a602e90f224e9a3304d4c1419/google/cloud/bigquery/table.py#L1976

Those 2 methods are where this is most useful to support incremental fetching use cases.

chalmerlowe commented 1 month ago

I can give it a try, but I won't be able to tackle it today. I have a few other things on my plate. Will keep you posted.

chalmerlowe commented 1 week ago

The original changes to this issue have been merged to python-bigquery version 3.27.0. Not sure when 3.27.0 will make it to PyPi, but as of this moment, it is available github.

I have not had the bandwidth to take the additional requests here.

chalmerlowe commented 4 days ago

The original changes to this have been released to PYPI as 3.27.0. Will see about adding the other changes when my schedule opens up.

kien-truong commented 1 day ago

Cheer, I've already created the PR for that here #2051

googleapis / python-bigquery

Make `max_stream_count` configurable when using Bigquery Storage API #2030