googleapis / python-bigquery

Apache License 2.0
724 stars 297 forks source link

Massive downloads (1B+ rows) causes read errors #1252

Open jlynchMicron opened 2 years ago

jlynchMicron commented 2 years ago

After troubleshooting for a long while on why my pandas read_gbq() (with bqstoarage_api enabled) query read request data throughput would drop like a rock early on in the download, I think I eventually found my answer.

Looking at the GCP API monitor, I saw that my requests would eventually error out in a 499 response message (client error).

After all my debugging, I found that this function was returning with 1000 read steams/threads to download. https://github.com/googleapis/python-bigquery/blob/main/google/cloud/bigquery/_pandas_helpers.py#L838

I believe that for massive query results and a (max_stream_count=requested_streams) value of 0, the BQ server returns with its max stream count of 1000 streams to use. This most likely overwhelms a system and causes some of the threads to die due to timeout connections or something like that. I found that when I forced the stream number to be much more reasonable, like 48 that my download worked fine.

Environment details

tswast commented 2 years ago

I wonder if removing the default pool size would be a sufficient fix?

https://github.com/googleapis/python-bigquery/blob/e760d1bcb76561b4247adde2fd06ae0b686befb9/google/cloud/bigquery/_pandas_helpers.py#L873

Alternatively (preferably?) we could set requested_streams to max(requested_streams, some multiple of the number of available cores)

https://github.com/googleapis/python-bigquery/blob/e760d1bcb76561b4247adde2fd06ae0b686befb9/google/cloud/bigquery/_pandas_helpers.py#L824

shollyman commented 1 year ago

The other option here is something like a bag of tasks where we bound concurrent work via semaphore but still allow for a large number of streams. It requires more concurrency coordination which admittedly isn't python's strong suit, but it would prevent overwhelming the client.