pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.69k stars 461 forks source link

table extraction not working properly - when there is a change in contrast between Title and rows #3668

Closed sreeram1658 closed 1 week ago

sreeram1658 commented 2 weeks ago

Description of the bug

I am trying to extract a table inside my pdf document using fitz -

doc = fitz.open("sample_table.pdf") page = doc[4] tabs = page.find_tables(horizontal_strategy="lines", vertical_strategy="lines",) tab = tabs[0] df = tab.to_pandas() df

My document - image

Output comes something like this - image

How to reproduce the bug

Already explained above

PyMuPDF version

1.24.5

Operating system

Windows

Python version

3.9

JorjMcKie commented 2 weeks ago

This post cannot be accepted as a an issue yet because a reproducing file has not been supplied.

JorjMcKie commented 1 week ago

Closed b/o extended period of time without user's reaction.