Closed hieudx149 closed 10 months ago
This is a known issue, unfortunately introduced in the current version. Will be fixed in the next version.
@JorjMcKie Thank you, looking forward to the next version.
I would like to know approximately when the next version will be released. The 'find_tables' function is simply too slow.
Fixed in PyMuPDF-1.23.8.
1.23.15 is too late
1.23.15 is too late
We improved the speed in version 1.23.8 already.
What do you mean? Still too slow in 1.23.8?
1.23.8 ~ 1.23.12 are not slow. but 1.23.13 ~ are slow.
1.23.8 ~ 1.23.12 are not slow. but 1.23.13 ~ are slow.
Unfortunately, your 1-line comments cannot lead to actionable consequences. Please provide one example file against which we can run comparisons.
Thanks for letting us have an example. The problem here is that there are 6914 vector graphics of which 6898 are white and borderless.
These graphics can be ignored using detection strategy "lines_strict" instead of the default "lines". The default "lines" takes the border of every rectangle into account when detection tables, even those that have no border.
This is the result when using "lines_strict":
import fitz
fitz.__version__
'1.23.16'
import time
mt=time.perf_counter
doc=fitz.open("slow.pdf")
page=doc[0]
t0=mt();tabs=page.find_tables(strategy="lines_strict");t1=mt()
t1-t0
0.32812639998155646
for tab in tabs:
print("-"*20)
for e in tab.extract():
print(e)
print()
--------------------
['', None, None]
['', '', '']
['', '', '']
--------------------
['']
['']
--------------------
['', '']
['', '']
['', '']
['', '']
['', '']
['', '']
page.rotation
0
graphics = page.get_drawings()
len(graphics)
6914
whitefills = [p for p in graphics if p["type"]=="f"]
len(whitefills)
6898
Three tables are detected, the other 2 drawings have no internal structures and are thus not regarded as tables.
Thank you very much. I have confirmed the improvement in processing speed. Please manual description of parameter 'lines_strict'.
Another problem happens. I set clip param.
1.23.12 page.find_tables(clip=(1129.0, 627.0, 1287.0, 829.0))
It's finding a table.
1.23.16 page.find_tables(clip=(1129.0, 627.0, 1287.0, 829.0), strategy="lines_strict")
edge["orientation"] = "h" if (line["top"] == line["bottom"]) else "v" KeyError: 'top'
line dic ends ↓ {'x0': 170.080078125, 'y0': 1027.5579833984375, 'x1': 170.080078125, 'y1': 162.991943359375, 'width': 0.0, 'height': 864.5660400390625, 'pts': [(170.080078125, 162.9920654296875), (170.080078125, 1027.55810546875)], 'linewidth': 1, 'stroke': True, 'fill': False, 'evenodd': False, 'stroking_color': (0, 0, 0), 'non_stroking_color': None, 'object_type': 'line', 'page_number': 1, 'stroking_pattern': None, 'non_stroking_pattern': None, 'top': 162.9920654296875, 'bottom': 1027.55810546875, 'doctop': 162.9920654296875} {}
I change table.py,
def make_line(p, p1, p2, clip):
"""Given 2 points, make a line dictionary for table detection."""
if not is_parallel(p1, p2): # only accepting axis-parallel lines
return {}
↓
def make_line(p, p1, p2, clip):
"""Given 2 points, make a line dictionary for table detection."""
if not is_parallel(p1, p2): # only accepting axis-parallel lines
pass
Errors no longer occur
@kamata - for the documentation issue: thank you, I am aware of this gap and already working on it.
Also thank you for reporting the error. This will be fixed in the next version.
Just a question: does this happen with "slow.pdf"? I cannot reproduce your error with it. Otherwise, please let me have the PDF for verification purposes.
Please note, Documentation for "lines_strict" now updated here: https://pymupdf.readthedocs.io/en/latest/page.html#Page.find_tables
@JorjMcKie This is sample pdf. clip_test.pdf
@jamie-lemon Thank you, documentation. I have read your article in Japanese.
While conducting experiments to extract tables from PDF files, I observed that the table-finding function (page.find_tables()) in PyMuPDF is relatively slow compared to PDFPlumber. Is this expected behavior?