The package sometimes stuck for too long

IronK77 commented 3 months ago

So it is me again.

I have found a small issue that sometimes the processing time can be too long (>30min) for some pdfs and the work still cannot be done. I will see if I can find some examples about this issue. I have tried the limit on image processing, but there is no effect. Since it is totally fine for to skip one or two pdfs in a 2000-pdf queue, I wonder if there can be a breaker setting or I can write a breaker using other packages.

dantetemplar commented 3 months ago

May you provide example of such PDF?

JorjMcKie commented 3 months ago

This package is based on PyMuPDF as you know. Tremendous amounts of vector graphics on pages can cause slowdowns because the to_markdown() method checks these objects when identifying text columns and tables.

Otherwise, time consumption is the same as in PyMuPDF itself. If you experience extraneous runtimes in spite of omitting pages with many vector graphics and using table_strategy="lines_strict", then PyMuPDF itself will have the same problem with that document. There indeed exist PDFs with an unfortunate internal structure, which cause any extractor to take long execution times.

JorjMcKie commented 3 months ago

Close for lack of reaction over an extended period of time.

pymupdf / RAG

The package sometimes stuck for too long #99