pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.31k stars 508 forks source link

pymupdf find tables too slow #2885

Closed hieudx149 closed 10 months ago

hieudx149 commented 10 months ago

While conducting experiments to extract tables from PDF files, I observed that the table-finding function (page.find_tables()) in PyMuPDF is relatively slow compared to PDFPlumber. Is this expected behavior?

JorjMcKie commented 10 months ago

This is a known issue, unfortunately introduced in the current version. Will be fixed in the next version.

hieudx149 commented 10 months ago

@JorjMcKie Thank you, looking forward to the next version.

zhaop-l commented 10 months ago

I would like to know approximately when the next version will be released. The 'find_tables' function is simply too slow. image

julian-smith-artifex-com commented 10 months ago

Fixed in PyMuPDF-1.23.8.

kamata commented 9 months ago

1.23.15 is too late

JorjMcKie commented 9 months ago

1.23.15 is too late

We improved the speed in version 1.23.8 already.

What do you mean? Still too slow in 1.23.8?

kamata commented 9 months ago

1.23.8 ~ 1.23.12 are not slow. but 1.23.13 ~ are slow.

JorjMcKie commented 9 months ago

1.23.8 ~ 1.23.12 are not slow. but 1.23.13 ~ are slow.

Unfortunately, your 1-line comments cannot lead to actionable consequences. Please provide one example file against which we can run comparisons.

kamata commented 9 months ago

Here is a sample.pdf slow.pdf

JorjMcKie commented 9 months ago

Thanks for letting us have an example. The problem here is that there are 6914 vector graphics of which 6898 are white and borderless.

These graphics can be ignored using detection strategy "lines_strict" instead of the default "lines". The default "lines" takes the border of every rectangle into account when detection tables, even those that have no border.

This is the result when using "lines_strict":

import fitz
fitz.__version__
'1.23.16'
import time
mt=time.perf_counter
doc=fitz.open("slow.pdf")
page=doc[0]
t0=mt();tabs=page.find_tables(strategy="lines_strict");t1=mt()
t1-t0
0.32812639998155646
for tab in tabs:
    print("-"*20)
    for e in tab.extract():
        print(e)
    print()

--------------------
['', None, None]
['', '', '']
['', '', '']

--------------------
['']
['']

--------------------
['', '']
['', '']
['', '']
['', '']
['', '']
['', '']

page.rotation
0
graphics = page.get_drawings()
len(graphics)
6914
whitefills = [p for p in graphics if p["type"]=="f"]
len(whitefills)
6898

Three tables are detected, the other 2 drawings have no internal structures and are thus not regarded as tables.

kamata commented 9 months ago

Thank you very much. I have confirmed the improvement in processing speed. Please manual description of parameter 'lines_strict'.

スクリーンショット 2024-01-22 12 29 13
kamata commented 9 months ago

Another problem happens. I set clip param.

1.23.12 page.find_tables(clip=(1129.0, 627.0, 1287.0, 829.0))

It's finding a table.

1.23.16 page.find_tables(clip=(1129.0, 627.0, 1287.0, 829.0), strategy="lines_strict")

edge["orientation"] = "h" if (line["top"] == line["bottom"]) else "v" KeyError: 'top'

line dic ends ↓ {'x0': 170.080078125, 'y0': 1027.5579833984375, 'x1': 170.080078125, 'y1': 162.991943359375, 'width': 0.0, 'height': 864.5660400390625, 'pts': [(170.080078125, 162.9920654296875), (170.080078125, 1027.55810546875)], 'linewidth': 1, 'stroke': True, 'fill': False, 'evenodd': False, 'stroking_color': (0, 0, 0), 'non_stroking_color': None, 'object_type': 'line', 'page_number': 1, 'stroking_pattern': None, 'non_stroking_pattern': None, 'top': 162.9920654296875, 'bottom': 1027.55810546875, 'doctop': 162.9920654296875} {}

I change table.py,

def make_line(p, p1, p2, clip):
    """Given 2 points, make a line dictionary for table detection."""
    if not is_parallel(p1, p2):  # only accepting axis-parallel lines
        return {}
↓
def make_line(p, p1, p2, clip):
    """Given 2 points, make a line dictionary for table detection."""
    if not is_parallel(p1, p2):  # only accepting axis-parallel lines
        pass

Errors no longer occur

JorjMcKie commented 9 months ago

@kamata - for the documentation issue: thank you, I am aware of this gap and already working on it.

Also thank you for reporting the error. This will be fixed in the next version.

Just a question: does this happen with "slow.pdf"? I cannot reproduce your error with it. Otherwise, please let me have the PDF for verification purposes.

jamie-lemon commented 9 months ago

Please note, Documentation for "lines_strict" now updated here: https://pymupdf.readthedocs.io/en/latest/page.html#Page.find_tables

kamata commented 8 months ago

@JorjMcKie This is sample pdf. clip_test.pdf

@jamie-lemon Thank you, documentation. I have read your article in Japanese.

article