Bug of the extract_table() function

jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

MIT License

6.31k stars 647 forks source link

Hi @Jack251970, and thanks for your interest in this library. Unfortunately, PDFs that may look the same don't always have the same internal structure. Using pdfplumber's visual debugging features to examine your example, it appears there are some extra, non-visible rectangles in the second table that interfere with its parsing: https://notebooksharing.space/view/933e7258736d316ee2cf2829dce2dbdc4b7e0e3fa8137fbfaa45011b3e707036

Because PDFs can encode visible and nonvisible lines/rectangles in so many different ways, I don't see this as a bug but rather just a complexity of working with PDFs.

To address your specific situation, you can try removing those lines via page.filter(...). See here for an example of how to do that: https://github.com/jsvine/pdfplumber/issues/311

jsvine / pdfplumber

Bug of the extract_table() function #795

Describe the bug

PDF file

Screenshots

Environment