For poorly formatted elements like this table, page.get_image_info() will return a large number of erroneous "images" like below. Probably the right decision is to set graphics_limit and ignore the page.
While this is an indication the layout detection is likely to not work well anyways, I'm thinking if it would be beneficial to add the ability to pass in an extra parameter min_area such that we ignore images with img['height'] * img['width'] < min_area as a simple threshold or if there is any feasible solution to this issue.
This would also fix the issue of outputting small logo watermarks located at the corners of documents which may exist on every single page.
For poorly formatted elements like this table,
page.get_image_info()
will return a large number of erroneous "images" like below. Probably the right decision is to setgraphics_limit
and ignore the page.While this is an indication the layout detection is likely to not work well anyways, I'm thinking if it would be beneficial to add the ability to pass in an extra parameter
min_area
such that we ignore images withimg['height'] * img['width'] < min_area
as a simple threshold or if there is any feasible solution to this issue.This would also fix the issue of outputting small logo watermarks located at the corners of documents which may exist on every single page.