pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
539 stars 82 forks source link

Exclude images based on size threshold parameter #134

Closed kingennio closed 2 months ago

kingennio commented 2 months ago

Hi there, I was having an issue with pymupdf4ll in that some of the images were not extracted from my pdf. At first I thought it was an issue with the margins settings and I tried to fiddle with those to no avail. Then I decided to step into the code with the debugger I finally found out the issue. The inner function "save_image" in function "to_markdown" at line 332 checks and discards all images whose size is less than 5% of the page size. While normally this might be a desired behavior, in my case this is not the case and I was missing some important info contained in those small images. If I may I'd like to suggest (yet another) param in to_markdown that allows overriding such setting. Something like: to_markdown(......., ignore_images_smaller_than=0.05) so that the caller can pass 0.0 to retain everything

For the time being I've modified the code because I do really need to keep those images. Thank you for your great package!

JorjMcKie commented 2 months ago

Fixed in version 0.0.15.