pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.17k stars 495 forks source link

Detect dotted gridlines for tables #3540

Closed JorjMcKie closed 3 months ago

JorjMcKie commented 4 months ago

Addresses #3539

We previously did not detect dotted lines used as table gridlines.

If one of width / height is LE edge_min_length and LT the other dimension, the rectangle is treated as a vertical / horizontal line. We incorrectly used the dimension-specific snap values for this.

We also no longer ignore rectangles if both, width and height are smaller than edge_min_length, but leave this to the snapping and joining algorithms further down the road.

JorjMcKie commented 3 months ago

The reason is that the PDF only has a handful of pages containing images at all: (0-based): 11, 149, 154, 158, 162, 164. Everything else are vector graphics. They cannot be extracted. All you can do is to find suitable rectangles wrapping them, then make a pixmap of those rectangles and save them as an image file. We have detection algorithms that are able to identify / cluster atomic graphical commands to larger drawings - see Page method cluster_drawings. The method returns a list of rectangles containing such cluster. Use this list as described above. There also is a little utility doing this here.

As per header / footer exclusion: This is your job - PDF knows nothing about what these things are. You could do the obvious thing and ignore every bbox returned by page.cluster_drawings() which has y0 <= 72 or y1 >= page.rect.height - 72. Modify 72 as appropriate.