Some images are wrongly extracted

drdsgvo commented 3 days ago

Example PDF: https://github.com/QwenLM/AutoIF/blob/main/Self-play_with_Execution_Feedback_Improving_Instruction-following_Capabilities_of_Large_Language%20Models.pdf

Figure 1 in the PDF (page 1) gets extracted as 2 images. Would maybe be OK. But the left 3 icons are missing in both image-parts. Better would be to have one image and not two, but that is not the main reason for raising this issue.

Figure 2 (page 3): is extracted completely, but on the extracted image there is also the figure text (may be OK), and some of the text below the figure. This is not good.

Figure 3 (page 5): same problems as with Figure 2.

PS: Great library!

JorjMcKie commented 2 days ago

There is an built-in default size limit for objects to be considered for saving: If any width or height of the image are less than 5% of the respective page edge, the image is ignored. Use something like below to reduce this threshold:

import pymupdf
import pymupdf4llm
import pathlib

doc = pymupdf.open("test.pdf")
md = pymupdf4llm.to_markdown(
    doc,
    write_images=True,
    force_text=False,  # use True if text on images should be extracted too
    margins=0,
    image_size_limit=0.01,  # or 1/1000 or whatever
)
pathlib.Path(doc.name + ".md").write_bytes(md.encode())

I hope you are aware of the difference between true images and vector graphics? Note: For vector graphics, there is no way to consider them as an integrated image! They consist of separate drawings commands (lines, rectangles, curves). While each of these commands can be extracted, you cannot know if they belong together. The library does a geometrical analysis and assumes that any singular draws probably belong together if they are not further apart than 3 points. It then joins the respective rectangles into one and makes an image of that area of the page. When that area also contains text, then just tough luck - or good luck if that text explains any Gantt charts or curves. You have the option to in addition also extract any text present in these rectangles.

The library takes the same decision it true images (like embedded PNGs) have text written upon them ... for the same motivation.

Once again: The library cannot know whether text on images makes sense or not.

drdsgvo commented 1 day ago

OK, understood. This is regarding figure 1. Call it a feature and use a parameter: For figure 1, there seem to exist 3 parts. These 3 parts could be joined automatically, I assume, as their rectangles are very close to each other.

Regarding figure 2 and 3: I will elaborate on this.

JorjMcKie commented 1 day ago

Investigate yourself! On page one the situation is the following: a wild mixture of (true) images contained within and / or overlapping each other. Standard text is written over some images and also close to images. Yet other text is not standard text but part of some image. The following picture shows image rectangles wrapped in red rectangles and standard (extractable) text has a yellow background. There simply is no way to programmatically make sense out of such a mess.

drdsgvo commented 1 day ago

OK, I see. The case I found and opened seems to be very complex. I was not aware of this. I understand that this cannot be handled easily.

pymupdf / RAG

Some images are wrongly extracted #148