minimum area for images & vector graphics

hewliyang commented 4 months ago

For poorly formatted elements like this table, page.get_image_info() will return a large number of erroneous "images" like below. Probably the right decision is to set graphics_limit and ignore the page.

ex1

ex2

While this is an indication the layout detection is likely to not work well anyways, I'm thinking if it would be beneficial to add the ability to pass in an extra parameter min_area such that we ignore images with img['height'] * img['width'] < min_area as a simple threshold or if there is any feasible solution to this issue.

This would also fix the issue of outputting small logo watermarks located at the corners of documents which may exist on every single page.

JorjMcKie commented 4 months ago

Thanks for your post. Valid point! We probably need a check for generally insignificant images: small areas, unicolor etc.

JorjMcKie commented 4 months ago

Fixed with v0.0.10.

pymupdf / RAG

minimum area for images & vector graphics #74