feat: add `job_retry` argument to `load_table_from_uri`

tswast commented 3 years ago

In internal issue 195911158, a customer is struggling to retry jobs that fail with "403 Exceeded rate limits: too many table update operations for this table". One can encounter this exception by attempting to run hundreds of load jobs in parallel.

Thoughts:

Try to reproduce. Does the exception happen at result() or load_table_from_uri()? If result(), continue with job_retry, otherwise see if we can modify the default retry predicate for load_table_from_uri() to find this rate limiting reason and retry.
Assuming the exception does happen at result(), modify load jobs (or more likely the base class) to retry if job retry is set, similar to what we do for query jobs.

Notes:

I suspect we'll need a different default job_retry object for load_table_from_uri(), as the retryable reasons will likely be different than what we have for queries.
I don't think the other load_table_from_* are as retryable as load_table_from_uri(), since they would require rewinding file objects, which isn't always possible. We'll probably want to consider adding job_retry to those load job methods in the future, but for now load_table_from_uri is what's needed.

tswast commented 3 years ago

Here's a stacktrace from a Googler who tried to reproduce this on their own project.

---------------------------------------------------------------------------

RemoteTraceback                           Traceback (most recent call last)

RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "<ipython-input-75-e46c7b68e71a>", line 12, in load_data
    job = load_job.result()  # Waits for the job to complete.
  File "/opt/conda/lib/python3.7/site-packages/google/cloud/bigquery/job/base.py", line 679, in result
    return super(_AsyncJob, self).result(timeout=timeout, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/google/api_core/future/polling.py", line 134, in result
    raise self._exception
google.api_core.exceptions.Forbidden: 403 Exceeded rate limits: too many table update operations for this table. For more information, see https://cloud.google.com/bigquery/docs/troubleshoot-quotas
"""

The above exception was the direct cause of the following exception:

Forbidden                                 Traceback (most recent call last)

<ipython-input-77-bef363ce70e2> in <module>
      1 with multiprocessing.Pool() as pool:
----> 2     pool.map(load_data, args)

/opt/conda/lib/python3.7/multiprocessing/pool.py in map(self, func, iterable, chunksize)
    266         in a list that is returned.
    267         '''
--> 268         return self._map_async(func, iterable, mapstar, chunksize).get()
    269 
    270     def starmap(self, func, iterable, chunksize=None):

/opt/conda/lib/python3.7/multiprocessing/pool.py in get(self, timeout)
    655             return self._value
    656         else:
--> 657             raise self._value
    658 
    659     def _set(self, i, obj):

Forbidden: 403 Exceeded rate limits: too many table update operations for this table. For more information, see https://cloud.google.com/bigquery/docs/troubleshoot-quotas

Indeed the exception does throw from result(). It might be nice to see the structured error data to help with our retry predicate though.

urwa commented 2 years ago

Having this exact problem in a cloud function triggered when data is uploaded to cloud bucket. Having job_retry argument to load_table_from_uri will definitely be very useful.

Right now, considering cloud function retry option but I plan to add monitoring on top of cloud function and want to keep logs clean for that even if retry was successful.

So now implementing exponential backoff in case of exception.

googleapis / python-bigquery

feat: add `job_retry` argument to `load_table_from_uri` #969