pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
302 stars 57 forks source link

Fixed quad abbreviation #93

Closed rca-umb closed 2 months ago

rca-umb commented 2 months ago

The first item of a Quad path should be "qu" not "q". Currently, when trying to process a PDF with Quad objects, the process will fail with:

AttributeError: 'Quad' object has no attribute 'tl'. Did you mean: 'll'?"

This is because the Quad check elif itm[0] == "q": evaluates to false and itm[0] gets processed liked a Rect. Changing this condition to elif itm[0] == "qu": makes Quads correctly processed.

From thePyMuPDF docs, relevant portion in bold:

Each item in path["items"] is one of the following: ("l", p1, p2) - a line from p1 to p2 (Point objects). ("c", p1, p2, p3, p4) - cubic Bézier curve from p1 to p4 (p2 and p3 are the control points). All objects are of type Point. ("re", rect, orientation) - a Rect. Multiple rectangles within the same path are now detected (changed in v1.18.17). Integer orientation is 1 resp. -1 indicating whether the enclosed area is rotated left (1 = anti-clockwise), or resp. right [7] (changed in v1.19.2). ("qu", quad) - a Quad. 3 or 4 consecutive lines are detected to actually represent a Quad (changed in v1.19.2:). (New in v1.18.17)

JorjMcKie commented 2 months ago

Thanks for submitting this. It has already been noted and corrected. So I will accept the PR but not merge it. Thanks again.