pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
268 stars 48 forks source link

'Quad' object has no attribute 'tl' #90

Closed IronK77 closed 2 weeks ago

IronK77 commented 1 month ago

Hi.

I am trying to use pymupdf4llm on google colab. The error 'Quad' object has no attribute 'tl' is on a specific pdf, but I can do it smoothly offline. I wonder what would be the related code. Not sure if it is related to the is_siginificant reporting, but maybe it is the same issue.

I am just checking the versions. I can do the job with 0.0.6-0.0.9, and the error is appeared in the recent 0.0.10.


AttributeError Traceback (most recent call last) in <cell line: 5>() 3 file_name=path.split("/")[-1] 4 path_full="/content/drive//MyDrive/eval/esg/"+file_name ----> 5 md_text = pymupdf4llm.to_markdown(path_full,write_images=False) 6 path_o="/content/drive//MyDrive/eval/esg-md/"+regex.sub("pdf","md",file_name) 7 pathlib.Path(path_o).write_bytes(md_text.encode())

2 frames /usr/local/lib/python3.10/dist-packages/pymupdf4llm/helpers/pymupdf_rag.py in is_significant(box, paths) 204 points.extend([r.tl, r.bl, r.br, r.tr, r.tl]) 205 else: # clockwise: area counts as negative --> 206 points.extend([r.tl, r.tr, r.br, r.bl, r.tl]) 207 area = poly_area(points) # compute area of polygon 208 if area < box_area: # less than threshold: graphic is significant

Stephen-S-H commented 1 month ago

I'm having the same issue on the occasional PDF. I'm doing a bit of testing and will let you know what I find.

Stephen-S-H commented 1 month ago

The same issue is here: https://github.com/pymupdf/RAG/issues/88

Buckler89 commented 1 month ago

I'm also having the same problem, but only with some PDFs. I haven't figured out the difference between a working PDF and one that doesn't work yet.

itsomar278 commented 2 weeks ago

Facing the same issue

JorjMcKie commented 2 weeks ago

Fixed in v0.0.11.

itsomar278 commented 1 week ago

The same PDF file ( 40 MB ) which triggered this exception now keeps running for hours with no extracted text I can provide the file if needed

JorjMcKie commented 1 week ago

@itsomar278 - A number of reasons could lead to this behavior. We can certainly look at the file itself. But as a first action, you can try the following. Use PyMuPDF to help track down on which pages of that file the most time is spent. Use PyMuPDF to help you do this:

import pymupdf4llm
import pymupdf
doc = pymupdf.open("input.pdf")
md = ""  # store markdown result here
for page in doc:
    md += pymupdf4llm.to_markdown(doc,
            pages=[page.number],  # deal with one page at a time
            hdr_info=False,  # ignore header tagging for now
            graphics_limit=2000,  # ignore pages with too many vector graphics
            )
    print(f"processed {page.number=}")  # show how far we are

# process created Markdown string
...

This can help you find out on which page a lot of time is spent. There also is parameter graphics_limit, which lets you deal with timing problems caused by too many vector graphics. A typical reasonable limit is 5000 or even 2000 as shown above.

itsomar278 commented 1 week ago

I implemented the get_pdf_text function and tested it successfully. The function processed all pages from 0 to 46, printing each page number as it was processed. Below is the code I used:

def get_pdf_text(pdf_docs): extracted = "" uploaded_docs_dir = Path("uploaded_docs")

for pdf in pdf_docs:
    pdf_path = save_uploaded_file(pdf, uploaded_docs_dir)
    doc = pymupdf.open(pdf_path)
    for page in doc:
        oldLength = len(extracted)
        extracted += pymupdf4llm.to_markdown(doc,
                                      pages=[page.number],  # deal with one page at a time
                                      hdr_info=False,  # ignore header tagging for now
                                      graphics_limit=2000,  # ignore pages with too many vector graphics
                                      )
        if(oldLength != len(extracted)):
            print(f"processed {page.number=}")  # show how far we are

return extracted
JorjMcKie commented 1 week ago

Glad it worked for you!

itsomar278 commented 1 week ago

I really appreciate your help ! Thanks