pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
243 stars 45 forks source link

Poor Markdown Generation for Particular PDF #75

Closed marty-sullivan closed 1 month ago

marty-sullivan commented 1 month ago

Here is a PDF file I'm testing with that produces some poor markdown (output is missing text that can be extracted by pymupdf). I believe this file was perhaps exported by Powerpoint into PDF.

test.pdf

Using pymupdf by itself, the text generated is fine:

import pymupdf

doc = pymupdf.open('test.pdf')
text = '\n'.join([page.get_text() for page in doc])

All text is accounted for in the output:

out.txt

Using pymupdf4llm:

import pymupdf4llm

md = pymupdf4llm.to_markdown('test.pdf')

Much of the text is missing in the output, seems like titles are there but much of the slide content is missing:

out.md

Expected Behavior:

I would expect that, even if formatting cannot be accounted for, that any text in the pdf should still be included in the markdown.

hewliyang commented 1 month ago

try .to_markdown('test.pdf', write_images=True), you will see that much content, especially the vector graphics are detected and outputted as images instead.

JorjMcKie commented 1 month ago

@hewliyang - thanks for your comment! You are quite right.

There are limits as to what the algorithm is capable to tell apart. It tries to detect (and separate from each other) images, drawings and text. @marty-sullivan - your example contains many text surrounded by rectangular drawings with large rounded corners: we are unable to tell this apart from a significant vector graphic. So they are ignored as text but made available as images.

Do not forget: this is not an AI!

marty-sullivan commented 1 month ago

@hewliyang While this may work for some, it is not going to be compatible with the fundamentals of what I understand this module is supposed to do when text is available.

@JorjMcKie While I understand this project is early in development, I was very excited by how well it worked for most PDFs and other documents. And, while I'm sure there are fundamental complications to what I am bringing up, I am not asking for an AI, so I see that as an odd interpretation of what I outlined as my expected behavior.

Given that the underlying pymupdf is able to trivially extract text from these elements, I would expect the same behavior from this module, even if it simply attempts to extract text and set it in a MD quoteblock (perhaps detecting binary vs unicode as part of the decision as to whether to do that).

I hope I don't come off as too critical here, but if the underlying module can trivially extract text from vector graphics, and this module cannot and your viewpoint is "we are not going to fix that" -> Well, that essentially makes this module completely unviable for any real use.

JorjMcKie commented 1 month ago

I understand - don't take my comments too far. What I am saying is this: We cannot and will not do a full-blown analysis of vector graphics in this module, that safely differentiates between graphics made for cosmetic / outlining purposes (as they occur in your example) and "significant" vector graphics like Gantt Charts, bar charts, and the like.

What I am doing currently is determining whether the inner area (3 points away from the rectangle's border) only consists of fill color (or is empty). Your boxes have rounded corners which are not contained in stripes of at most 3 points breadth, so I regard them as "significant" vector graphics ... causing the text inside to be ignored. That's the situation at hand.

If you have a suggestion how to handle this, you are most welcome.

marty-sullivan commented 1 month ago

Understood.

I suppose my viewpoint would be: if the user has called .to_markdown('test.pdf', write_images=False), I think the behavior should be to dump any text that can be extracted from the element into a quote block rather than completely ignoring it. Or, an additional flag of dump_graphics_as_text=True might also be appropriate, so as to not change the current behavior.

I do see the behavior being valuable that, if the user has called .to_markdown('test.pdf', write_images=True) in that case, trying to discern complex graphics, rendering, and saving them as PNG does make sense as the expected behavior.

JorjMcKie commented 1 month ago

@marty-sullivan thanks for your constructive ideas! Maybe a combination of an additional parameter (dump_graphics ...) plus a somewhat improved significance detection is the way to go.

Currently looking at your rectangles with rounded corners: They cause the problem - normal rectangles wouldn't. If I find a way to compute the enclosed area and compare it with the area of the "inner rectangle" may yield an improved significance check. The rounded rectangle's area can be approximated when regarding all its defining points as a polygon, the area of which should be computable via Gauß's shoelace algorithm. If that result is close to the graphics rectangle, then we know that the object is insignificant. I hope I am clear ...

JorjMcKie commented 1 month ago

Developed a new "significance check" for vector graphics that works as announced (shoelace algorithm). Graphics consisting of "text decoration" only (highlights, rectangles with rounded corner and the like) are now identified as irrelevant / insignificant. This however does not solve the general case as can be seen in this example: image

Green borders surround significant vector graphics because of whatever complexity criteria. Red borders indicate unimportant / decoration-only drawings.

We can see that the algorithm has no way to detect that all graphics are on the same semantic level here and should all either be regarded as vector graphic or all to be treated as text.

So in wrapping up, I think there is no way out of this situation other than introducing a new parameter which instructs to extract all text - whether or not contained inside a graphics (or image!) rectangle.

JorjMcKie commented 1 month ago

Fixed with v0.0.10.