Closed themantalope closed 1 year ago
This error is actually due to Message jina.DataRequestProto exceeds maximum protobuf size of 2GB
, so it only occurs with certain pdf files. I get this error message with protobuf==3.20.3
. Maybe this error is somehow hidden by grpcio
with lastest version of protobuf
.
For example, if you change
doc.chunks.extend([Document(tensor=img, mime_type='image/*') for img in images])
to
doc.chunks.extend([Document(tensor=img, mime_type='image/*') for img in images[:1]])
and similarly change text to
doc.chunks.extend([Document(text=t, mime_type='text/plain') for t in texts[:1]])
you can notice the error is gone.
@AnneYang720 thank you for your help.
Yes, I also dropped the installed protobuf to 3.18
and got an error regarding the message size exceeding 2GB. Setting the gprc
server message size settings to greater than 2GB in the Flow
setup also throws an error.
The issue with your suggested approach is that it removes the data associated with the first page of the document. For example, the output of print(len(texts[0]), sum([len(t) for t in texts]))
for the document in question is 2538 47318
, so I'd just be removing data.
Is there a guide for handling large files with Jina? Seems like this could be an issue, especially for videos, images, and large PDF files. Or is this happening because the data is all getting loaded as a DocumentArray
in memory?
For example in the example-video-search-app I see that documents are indexed one at a time. Does that affect message size?
My example was just another attempt to verify the error is because of the size limit.
After extraction of images from pdf file, the images with shape 4350 3300 3
are much larger than the file itself. You can simply store them in your executor and change return
to return DocumentArray()
or return None
in def craft
. By default, it returns the original docs object (see the doc).
A more general way we suggest is that you extract the images first and store them elsewhere (such as local file system). Instead of
doc.chunks.extend([Document(tensor=img, mime_type='image/*') for img in images])
you can do
doc.chunks.extend([Document(uri='data/your_image.png', mime_type='image/*') for img in images])
and call the function load_uri_to_image_tensor()
when you need to work with the images.
@AnneYang720 thank you for your input.
Describe the bug Using the
PDFSegmenter
, a gateway runtime error occurs. I can confirm that the issue is not with the text or image extraction of thePDFSegmenter
as that runs without error. After the code runs, it seems to hang for about 10 additional seconds then I get an error message:Additional details:
test.py
:pdf_segmenter.py
:Here is the output with running
JINA_LOG_LEVEL=DEBUG python test.py
:I also note that this does not occur with every PDF. The PDF which is causing problems is attached. rg.25si055505.pdf
Describe how you solve it
Environment