pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
539 stars 82 forks source link

Some images are missing with new version #152

Closed Cozokim closed 1 month ago

Cozokim commented 1 month ago

Hi, I had a funcional version with pymupdf ==1.24.9 and pymupdf4llm==0.0.16

But now, When I do a new env from 0 with pymupdf ==1.24.10 and pymupdf4llm== 0.0.17 and I use pymupdf4llm.to_markdown(write_images=True), some images from my PDF are not extracted.

Downgrading manually the library didn't fix the problem, but instaling from scratch with version pymupdf ==1.24.9 and pymupdf4llm==0.0.16 did so I assume it's about another library installed during the pip install of those versions.

I was allowed to fix the problem, but it was just so you know :)

Cheers

JorjMcKie commented 1 month ago

There is a size limit for images to be considered for extraction. This is to prevent tiny little dirt to become honored with a file output. The default is image_size_limit=0.05 which ignores images having width (height) < 0.05 * page.rect.width (height). Set this to a smaller number if you want to see more images.