pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
539 stars 82 forks source link

minimum area for images & vector graphics #74

Closed hewliyang closed 4 months ago

hewliyang commented 4 months ago

image

For poorly formatted elements like this table, page.get_image_info() will return a large number of erroneous "images" like below. Probably the right decision is to set graphics_limit and ignore the page.

ex1

ex2

While this is an indication the layout detection is likely to not work well anyways, I'm thinking if it would be beneficial to add the ability to pass in an extra parameter min_area such that we ignore images with img['height'] * img['width'] < min_area as a simple threshold or if there is any feasible solution to this issue.

This would also fix the issue of outputting small logo watermarks located at the corners of documents which may exist on every single page.

JorjMcKie commented 4 months ago

Thanks for your post. Valid point! We probably need a check for generally insignificant images: small areas, unicolor etc.

JorjMcKie commented 4 months ago

Fixed with v0.0.10.