Open brent-ridian opened 1 month ago
Assigning to @daniel-sanche who has more knowledge about the Firestore SDK.
A possible clue from a quick google search: this stackoverflow thread suggests this may be caused by a large number of docs that are still being updated. I wonder if this has any similarity to your case.
Linchin: thx for the update.
My guess is that your SO thread does not apply to me: I only see the error reported here when I am trying to read every document in my largest collection, and the vast majority of the documents in that collection are not being updated. At most, maybe about 200 theoretically could be, but in practice, I bet that many fewer than that are.
Google: are you here?
had the same issue, this worked for me
def update_collection_paginated(page_size=20000):
query = db_prod.collection('someCollection').where(filter=SomeFilter)
# Get the first page
docs = query.limit(page_size).stream()
batches = [db_prod.batch()]
current_batch = 0
operation_count = 0
last_doc = None
while True:
doc_count = 0
for doc in tqdm(docs, total=page_size):
doc_count += 1
last_doc = doc
doc_dict = doc.to_dict()
try:
# some update logic here
except KeyError:
pass
batches[current_batch].set(doc.reference, doc_dict)
operation_count += 1
# If the operation count reaches 500, move to the next batch
if operation_count == 500:
operation_count = 0
current_batch += 1
batches.append(db_prod.batch())
# If we've processed fewer documents than the page size, we're done
if doc_count < page_size:
break
# Construct a new query starting after the last document
docs = query.start_after(last_doc).limit(page_size).stream()
return batches
# Execute the paginated update
batches = update_collection_paginated()
for batch in tqdm(batches):
batch.commit()
Environment details
google-cloud-firestore
version: 2.16.1My guess is that the bug I report below is independent of OS, Python version, pip version but very well may depend on firestore version.
Steps to reproduce
Non trivial. As described in "The Bug" section below, the bug only manifests when I try to read every document in my company's biggest collection. All our other collections do not trigger the bug. If someone from Google reaches out to me, I would be glad to share precise details of that collection.
Code example
Included in "The Bug" section below
Stack trace
Included in "The Bug" section below
The Bug
I am seeing the exact same issue as was reported in this StackOverflow link.
The Google engineer dconeybe in this link recommended that I file a bug report in this GitHub project, since he suspects that the bug is in the Firestore Python client library.
So, the text below is my adaptation of that StackOverflow:
I have a Python function that takes a cloud Firestore collection name as an arg and streams thru every document in that collection to check for errors.
In simplified form, it essentially looks something like this:
The key point here is that I create a stream for the collection and use that to read and process every document in the collection.
The code always works perfect on all but one of my collections.
Unfortunately, on the biggest collection, I not infrequently get an error like this:
This is super frustrating. I can't believe that Firestore cannot create and maintain a rock solid database stream!
The error message offers this very useful sounding advice:
Please try either limiting the entities scanned, or run with an updated index configuration.
The problem is that I have no idea how to execute on it:
limiting the entities scanned
: how can I possibly do that? I need to process every document in the collection! Furthermore, I really hope that Google is not expecting Firestore users to somehow manually break up large reads.run with an updated index:
I have no idea what index to use to solve this problem. In other Firestore contexts, I have seen Google's error message very helpfully give you the exact index command you should execute to solve the problem. Unfortunately, here Google tells me nothing.