Open adonoho opened 11 months ago
Hi @adonoho, thank you for reporting this issue. I tried, but haven't been able to reproduce it. I suspect this has something to do with your network being unstable. Maybe you can add a timeout for the df.to_gbq()
function and retry if it stalls? Also, you said "Run lots of jobs, write to GBQ 100,000+ times", but I hit quota error just after 1500 insertions. How did you bypass this?
@Linchin This is a program that collects values from a compute cluster. Each function returns the single row of a data frame. They are concatenated and then written to GBQ. I've had jobs create 5M rows in 4K row chunks, i.e. every minute or 4k rows whichever occurs sooner. I will explore the timeout function to retry. I am happy to instrument my code however you might wish to help find this problem.
BTW, df.to_gbq()
doesn't appear to support a timeout parameter. (I am new to Google APIs. Please forgive me if it is documented in an non-obvious place to me.)
Indeed df.to_gbq()
doesn't have a timeout option. I'm more thinking about using Python to do it, such as the examples here.
Presumably, the underlying Google API calls support timeouts? Is a better answer to surface exceptions that involve timeouts? (I followed the link you mentioned and because they say they don't think it plays well with threads will rule it out. FTR, this is a DASK app that is gathering data via Tornado and presenting it to the single threaded __main__
code. I am really quite happy to implement timeout catching code instead of making the loop potentially unstable.
From the above trace, I found the following interesting #TODO
in load_chunks()
line 252:
if api_method == "load_parquet":
load_parquet(
client,
dataframe,
destination_table_ref,
write_disposition,
location,
schema,
billing_project=billing_project,
)
# TODO: yield progress depending on result() with timeout
return [0]
Clearly, the new load_parquet()
method is not yet complete. What can I do to help fix this code? (Bear in mind that I am, due to inexperience with Google APIs, uncertain how the maintenance team manages timeout issues in pandas_gbq
.
Environment details
python --version
3.10.12pip --version
pip 23.2.1 from /Users/awd/mambaforge/envs/AMPMatrixRecovery/lib/python3.10/site-packages/pip (python 3.10)pandas-gbq
version:pip show pandas-gbq
Name: pandas-gbq Version: 0.19.2 Summary: Google BigQuery connector for pandas Home-page: https://github.com/googleapis/python-bigquery-pandas Author: pandas-gbq authors Author-email: googleapis-packages@google.com License: BSD-3-Clause Location: /Users/awd/mambaforge/envs/AMPMatrixRecovery/lib/python3.10/site-packages Requires: db-dtypes, google-api-core, google-auth, google-auth-oauthlib, google-cloud-bigquery, google-cloud-bigquery-storage, numpy, pandas, pyarrow, pydata-google-auth, setuptools Required-by: EMSSteps to reproduce
Code example
The DB is already set up in this method and the credentials are not
None
. The stall happens in thedf.to_gbq()
call. No exception is thrown to be caught.Stack trace
Making sure to follow these steps will guarantee the quickest resolution possible.
Thanks!