Open kien-truong opened 1 month ago
Using the default setting, in the worst-case scenario, for n-download streams, we would have to store 2n
result pages in memory:
1
result page inside each download thread, times n
threadsn
result page in the transfer queue between download threads and main threadI am putting together a PR for this.
It is not as simple as just adding a max_stream_count
argument because historically the argument preserve_order
has been used to make some choices regarding the number of streams to use.
We have some logic to allow those two to interact in a way that makes sense and hopefully doesn't break backward compatibility.
We are also issuing a docstring to explain the method as a whole (this feels necessary since we have a new argument that may interact with another argument).
More to come, soon.
Right, you can't use more than 1 stream if you need to preserve row order when using ORDER BY, the concurrency will screw up the output.
Alright. Got a draft PR. Still working on the unit tests. But I need to focus on something else right now. Will come back to this with a clear head.
@kien-truong let me know if the updates meet the need.
The code is available in main
and will included in the next release. (I don't have a date yet for the next release.)
Thanks @chalmerlowe, can you also add the max_stream_count
argument to 2 methods to_dataframe_iterable
and to_arrow_iterable
?
Those 2 methods are where this is most useful to support incremental fetching use cases.
I can give it a try, but I won't be able to tackle it today. I have a few other things on my plate. Will keep you posted.
The original changes to this issue have been merged to python-bigquery version 3.27.0. Not sure when 3.27.0 will make it to PyPi, but as of this moment, it is available github.
I have not had the bandwidth to take the additional requests here.
The original changes to this have been released to PYPI as 3.27.0. Will see about adding the other changes when my schedule opens up.
Cheer, I've already created the PR for that here #2051
Currently, for API that can use BQ Storage Client to fetch data like
to_dataframe_iterable
orto_arrow_iterable
, the client library always uses the maximum number of read streams recommended by BQ server.https://github.com/googleapis/python-bigquery/blob/ef8e92787941ed23b9b2b5ce7c956bcb3754b995/google/cloud/bigquery/_pandas_helpers.py#L840
https://github.com/googleapis/python-bigquery/blob/ef8e92787941ed23b9b2b5ce7c956bcb3754b995/google/cloud/bigquery/_pandas_helpers.py#L854-L858
This behavior has the advantage of maximizing throughput but can lead to out-of-memory issue when there are too many streams being opened and result are not read fast enough: we've encountered queries that open hundreds of streams and consuming GBs of memory.
BQ Storage Client API also suggests capping
max_stream_count
when resource is constrainedhttps://cloud.google.com/bigquery/docs/reference/storage/rpc/google.cloud.bigquery.storage.v1#createreadsessionrequest
This problem has been encountered by others before and can be worked-around by monkey-patching the
create_read_session
on the BQ Client object: https://github.com/googleapis/python-bigquery/issues/1292However, it should really be fixed by allowing the
max_stream_count
parameter to be set through public API.