pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
302 stars 57 forks source link

The Markdown syntax for images is always included in the Markdown output. #67

Closed tamdao closed 2 months ago

tamdao commented 2 months ago

I use v0.0.8 with to_markdown(doc) by default write_images=False But the markdown syntax for images is always included in the Markdown output.

saint.pdf output.md

There is another issue with this file. When I set write_images=True, it doesn't work correctly. Even though the file doesn't contain any images, the result includes some white images.

JorjMcKie commented 2 months ago

Cannot reproduce! The is the output I am getting image saint.md

JorjMcKie commented 2 months ago

Your PDF contains multiple areas with a white background - written as vector graphics. Version 0.0.8 has an improved logic that ignores vector graphics which only consist of background coloring. Significant vector graphics are converted to images and treated like images. A significant vector graphic must also contain stroked drawings (which are not part of the drawing rectangle's border).

tamdao commented 2 months ago

@JorjMcKie Thanks for your response. I apologize for the confusion; I checked the wrong version.