Closed IronK77 closed 3 months ago
May you provide example of such PDF?
This package is based on PyMuPDF as you know.
Tremendous amounts of vector graphics on pages can cause slowdowns because the to_markdown()
method checks these objects when identifying text columns and tables.
Otherwise, time consumption is the same as in PyMuPDF itself. If you experience extraneous runtimes in spite of omitting pages with many vector graphics and using table_strategy="lines_strict"
, then PyMuPDF itself will have the same problem with that document.
There indeed exist PDFs with an unfortunate internal structure, which cause any extractor to take long execution times.
Close for lack of reaction over an extended period of time.
So it is me again.
I have found a small issue that sometimes the processing time can be too long (>30min) for some pdfs and the work still cannot be done. I will see if I can find some examples about this issue. I have tried the limit on image processing, but there is no effect. Since it is totally fine for to skip one or two pdfs in a 2000-pdf queue, I wonder if there can be a breaker setting or I can write a breaker using other packages.