jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.31k stars 647 forks source link

Bug of the extract_table() function #795

Closed Jack251970 closed 1 year ago

Jack251970 commented 1 year ago

Describe the bug

If there are multiple rows of content in a table, there's a probablity that there will be problems using extract_tables(). Problems include the presence of extra rows or columns.

PDF file

This is the pdf without this bug. without_bugs.pdf

This is the pdf with this bug. with_bugs.pdf

The format of the tables in these two pdf files is almost identical, but the tables parsed are very different.

Just use the extract_tables(), and you can see the difference.

Screenshots

There is my pdf. SharedScreenshot1

There is the part of the tables I parsed and I show it in table form. pdfplumber's Bug

Environment

jsvine commented 1 year ago

Hi @Jack251970, and thanks for your interest in this library. Unfortunately, PDFs that may look the same don't always have the same internal structure. Using pdfplumber's visual debugging features to examine your example, it appears there are some extra, non-visible rectangles in the second table that interfere with its parsing: https://notebooksharing.space/view/933e7258736d316ee2cf2829dce2dbdc4b7e0e3fa8137fbfaa45011b3e707036

Screen Shot

Because PDFs can encode visible and nonvisible lines/rectangles in so many different ways, I don't see this as a bug but rather just a complexity of working with PDFs.

To address your specific situation, you can try removing those lines via page.filter(...). See here for an example of how to do that: https://github.com/jsvine/pdfplumber/issues/311