Detect dotted gridlines for tables

pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

GNU Affero General Public License v3.0

5.17k stars 495 forks source link

The reason is that the PDF only has a handful of pages containing images at all: (0-based): 11, 149, 154, 158, 162, 164. Everything else are vector graphics. They cannot be extracted. All you can do is to find suitable rectangles wrapping them, then make a pixmap of those rectangles and save them as an image file. We have detection algorithms that are able to identify / cluster atomic graphical commands to larger drawings - see Page method cluster_drawings. The method returns a list of rectangles containing such cluster. Use this list as described above. There also is a little utility doing this here.

As per header / footer exclusion: This is your job - PDF knows nothing about what these things are. You could do the obvious thing and ignore every bbox returned by page.cluster_drawings() which has y0 <= 72 or y1 >= page.rect.height - 72. Modify 72 as appropriate.

pymupdf / PyMuPDF

Detect dotted gridlines for tables #3540