This is an example document with redacted information. Real documents have a similar structure, but with more pages, sometimes with multiple and different tables.
Expected behavior
There should be two tables:
One with columns: ODSOTNI UČITELJ/ICA, URA, RAZRED, UČILNICA, NADOMEŠČA, PREDMET, OPOMBA
Another with columns: RAZRED, URA, UČITELJ/ICA, PREDMETA, UČILNICA, OPOMBA
Text between those two tables (MENJAVA UR) should not be in any of tables.
Actual behavior
What actually happened, instead?
Tables are extracted as one. This causes a lot of extra and misaligned columns with None:
Screenshots
Environment
pdfplumber version: 0.9.0
Python version: 3.10
OS: Windows
Additional context
PDFs that I need to parse all share a similar layout with around 5 different table formats. However, not all files use all of those table formats, and not all tables have the same height, making it impossible to just crop the page before extracting tables.
Maybe this can be solved by adding an option to only extract tables where there are visible lines around cells?
Describe the bug
On a PDF document where there are multiple tables on the same page, both of them are extracted as one table. This causes misaligned and empty columns.
Code to reproduce the problem
PDF file
example document.pdf
This is an example document with redacted information. Real documents have a similar structure, but with more pages, sometimes with multiple and different tables.
Expected behavior
There should be two tables:
Text between those two tables (MENJAVA UR) should not be in any of tables.
Actual behavior
What actually happened, instead?
Tables are extracted as one. This causes a lot of extra and misaligned columns with
None
:Screenshots
Environment
Additional context
PDFs that I need to parse all share a similar layout with around 5 different table formats. However, not all files use all of those table formats, and not all tables have the same height, making it impossible to just crop the page before extracting tables.
Maybe this can be solved by adding an option to only extract tables where there are visible lines around cells?