grpc._channel._MultiThreadedRendezvous fails; StatusCode.UNAVAILABLE; Query timed out

brent-ridian commented 1 month ago

Environment details

OS type and version: I use Windows locally, and Linux in prd
Python version: I use the latest release of Python, which is currently Python 3.12.4
pip version: pip 24.1.2
google-cloud-firestore version: 2.16.1

My guess is that the bug I report below is independent of OS, Python version, pip version but very well may depend on firestore version.

Steps to reproduce

Non trivial. As described in "The Bug" section below, the bug only manifests when I try to read every document in my company's biggest collection. All our other collections do not trigger the bug. If someone from Google reaches out to me, I would be glad to share precise details of that collection.

Code example

Included in "The Bug" section below

Stack trace

Included in "The Bug" section below

The Bug

I am seeing the exact same issue as was reported in this StackOverflow link.

The Google engineer dconeybe in this link recommended that I file a bug report in this GitHub project, since he suspects that the bug is in the Firestore Python client library.

So, the text below is my adaptation of that StackOverflow:

I have a Python function that takes a cloud Firestore collection name as an arg and streams thru every document in that collection to check for errors.

In simplified form, it essentially looks something like this:

from firebase_admin import firestore

def find_all_issues(collection: str) -> None:
    fs_client = firestore.client()
    coll_ref = fs_client.collection(collection)
    doc_snapshot_stream = coll_ref.stream()
    for doc_snap in doc_snapshot_stream:
        d = doc_snap.to_dict()
        # perform error checks on d

The key point here is that I create a stream for the collection and use that to read and process every document in the collection.

The code always works perfect on all but one of my collections.

Unfortunately, on the biggest collection, I not infrequently get an error like this:

Traceback (most recent call last):
  File "...\virtualenvs\backend-venv\Lib\site-packages\google\api_core\grpc_helpers.py", line 116, in __next__
    return next(self._wrapped)
           ^^^^^^^^^^^^^^^^^^^
  File "...\virtualenvs\backend-venv\Lib\site-packages\grpc\_channel.py", line 543, in __next__
    return self._next()
           ^^^^^^^^^^^^
  File "...\virtualenvs\backend-venv\Lib\site-packages\grpc\_channel.py", line 969, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "Query timed out. Please try either limiting the entities scanned, or run with an updated index configuration."
    debug_error_string = "UNKNOWN:Error received from peer ipv4:142.250.9.95:443 {created_time:"2024-07-18T15:21:04.3215985+00:00", grpc_status:14, grpc_message:"Query timed out. Please try either limiting the entities scanned, or run with an updated index configuration."}"
>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "...\virtualenvs\backend-venv\Lib\site-packages\google\cloud\firestore_v1\query.py", line 361, in stream
    response = next(response_iterator, None)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...\virtualenvs\backend-venv\Lib\site-packages\google\api_core\grpc_helpers.py", line 119, in __next__
    raise exceptions.from_grpc_error(exc) from exc
google.api_core.exceptions.ServiceUnavailable: 503 Query timed out. Please try either limiting the entities scanned, or run with an updated index configuration.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "...\firestore_data_investigation.py", line 258, in <module>
    check_collections(stop_if_issue = False)
  File "...\firestore_data_investigation.py", line 133, in check_collections
    num_docs, issues = all_issues(XXX.COLLECTION)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...\XXX_integration_test.py", line 390, in all_issues
    for doc_snap in doc_snapshots:
  File "...\virtualenvs\backend-venv\Lib\site-packages\google\cloud\firestore_v1\query.py", line 363, in stream
    if self._retry_query_after_exception(exc, retry, transaction):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...\virtualenvs\backend-venv\Lib\site-packages\google\cloud\firestore_v1\query.py", line 240, in _retry_query_after_exception
    retry = gapic_callable._retry
            ^^^^^^^^^^^^^^^^^^^^^
AttributeError: '_UnaryStreamMultiCallable' object has no attribute '_retry'

This is super frustrating. I can't believe that Firestore cannot create and maintain a rock solid database stream!

The error message offers this very useful sounding advice: Please try either limiting the entities scanned, or run with an updated index configuration.

The problem is that I have no idea how to execute on it:

limiting the entities scanned: how can I possibly do that? I need to process every document in the collection! Furthermore, I really hope that Google is not expecting Firestore users to somehow manually break up large reads.
run with an updated index: I have no idea what index to use to solve this problem. In other Firestore contexts, I have seen Google's error message very helpfully give you the exact index command you should execute to solve the problem. Unfortunately, here Google tells me nothing.

Linchin commented 1 month ago

Assigning to @daniel-sanche who has more knowledge about the Firestore SDK.

A possible clue from a quick google search: this stackoverflow thread suggests this may be caused by a large number of docs that are still being updated. I wonder if this has any similarity to your case.

brent-ridian commented 1 month ago

Linchin: thx for the update.

My guess is that your SO thread does not apply to me: I only see the error reported here when I am trying to read every document in my largest collection, and the vast majority of the documents in that collection are not being updated. At most, maybe about 200 theoretically could be, but in practice, I bet that many fewer than that are.

brent-ridian commented 1 month ago

Google: are you here?

ozayr commented 3 weeks ago

had the same issue, this worked for me

def update_collection_paginated(page_size=20000):
    query = db_prod.collection('someCollection').where(filter=SomeFilter)

    # Get the first page
    docs = query.limit(page_size).stream()

    batches = [db_prod.batch()]
    current_batch = 0
    operation_count = 0

    last_doc = None

    while True:
        doc_count = 0
        for doc in tqdm(docs, total=page_size):
            doc_count += 1
            last_doc = doc

            doc_dict = doc.to_dict()
            try:
                # some update logic here

            except KeyError:
                pass

            batches[current_batch].set(doc.reference, doc_dict)
            operation_count += 1

            # If the operation count reaches 500, move to the next batch
            if operation_count == 500:
                operation_count = 0
                current_batch += 1
                batches.append(db_prod.batch())

        # If we've processed fewer documents than the page size, we're done
        if doc_count < page_size:
            break

        # Construct a new query starting after the last document
        docs = query.start_after(last_doc).limit(page_size).stream()

    return batches

# Execute the paginated update
batches = update_collection_paginated()
for batch in tqdm(batches):
    batch.commit()

googleapis / python-firestore