ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.12k stars 1.02k forks source link

[Bug]: Scan time increases quadratically with page count #1378

Closed aliemjay closed 2 weeks ago

aliemjay commented 3 months ago

When using OCRMyPDF with the --redo-ocr option, I noticed that the "Scanning contents" phase can become excessively long for large input files.

Surprisingly, the total processing time seems to scale non-linearly with the page count making the estimated time-to-complete feature useless.

Here are the progress bar timings when scanning a 1000 pages file on Core i5 12400 under v16.4.2:

$ ocrmypdf --redo-ocr in.pdf out.pdf

progress -> time -> diff
 20% -> 20.8s -> 20.8s
 40% -> 60.8s -> 40.0s
 60% -> 121s  -> 60.5s
 80% -> 200s  -> 78.7s
100% -> 300s  -> 100s

Upon investigation, I traced the issue back to https://github.com/ocrmypdf/OCRmyPDF/blob/5e1e2497abd92ac18caf4e6f7ec439dbe59aaf2a/src/ocrmypdf/pdfinfo/layout.py#L309

The inefficient implementation of PDFPage.get_pages seems to iterate over all previous pages before getting to pageno. This theoretically means a quadratic time complexity.

The following quick and dirty patch reduced the scan time from 300s to 10s for the same file

# TODO addthe patch
jbarlow83 commented 3 months ago

Note that get_page_analysis runs in a worker thread/process context so a fix that passes all test may not be super straightforward.

aliemjay commented 3 months ago

Here is my patch, it passed all tests locally.

diff --git a/src/ocrmypdf/pdfinfo/layout.py b/src/ocrmypdf/pdfinfo/layout.py
index 83d5d32b..fbef6a0d 100644
--- a/src/ocrmypdf/pdfinfo/layout.py
+++ b/src/ocrmypdf/pdfinfo/layout.py
@@ -288,6 +288,23 @@ def patch_pdfminer(pscript5_mode: bool):
     else:
         yield

+class PageCache:
+    cache = {}
+
+    @classmethod
+    def get_cached(cls, infile: PathLike, pageno: int):
+        if infile in cls.cache:
+            for cache_pageno, cache_page in cls.cache[infile]:
+                if cache_pageno == pageno:
+                    return cache_page
+                elif cache_pageno > pageno:
+                    break
+            else:
+                return None
+
+        f = Path(infile).open('rb')
+        cls.cache[infile] = enumerate(PDFPage.get_pages(f))
+        return cls.get_cached(infile, pageno)

 def get_page_analysis(
     infile: PathLike, pageno: int, pscript5_mode: bool
@@ -306,8 +323,7 @@ def get_page_analysis(
     with patch_pdfminer(pscript5_mode):
         try:
             with Path(infile).open('rb') as f:
-                page_iter = PDFPage.get_pages(f, pagenos=[pageno], maxpages=0)
-                page = next(page_iter, None)
+                page = PageCache.get_cached(infile, pageno)
                 if page is None:
                     raise InputFileError(
                         f"pdfminer could not process page {pageno} (counting from 0)."
aliemjay commented 3 months ago

For testing, this is a file with 10k blank pages: blank-10k.pdf

Real-world files require much less pages for this bug to be noticeable.

The test was done under v16.4.2, because v16.4.3 introduced an unrelated regression to scan times. I'll file a separate issue soon.

$ ocrmypdf blank-10k.pdf out.pdf --redo-ocr
jbarlow83 commented 2 months ago

I see a few issues with the approach taken in the patch. In particular the opened file handle is not explicit closed appropriately - this could be problematic for PyPy.

aliemjay commented 2 months ago

I see a few issues with the approach taken in the patch. In particular the opened file handle is not explicit closed appropriately - this could be problematic for PyPy.

Can we add a new method, PageCache.clear, which closes all open files and is called somewhere in the pipeline upon completion of the scan? I think this is the only way to fix it without disturbing the public api.

this could be problematic for PyPy.

I think you mean PyPi the repository, as the issue is cross-platform compatibility with Windows?

jbarlow83 commented 2 months ago

I think that PageCache needs to be attached to PdfInfo and created as a distinct object rather than using global state. PdfInfo should create a PageCache and pass it on to each PageInfo; after page information is gathered, PageCache can be cleared.

PyPi?

No, PyPy, the alternative Python implementation, which ocrmypdf supports.

dhdaines commented 2 months ago

Upon investigation, I traced the issue back to

https://github.com/ocrmypdf/OCRmyPDF/blob/5e1e2497abd92ac18caf4e6f7ec439dbe59aaf2a/src/ocrmypdf/pdfinfo/layout.py#L309

The inefficient implementation of PDFPage.get_pages seems to iterate over all previous pages before getting to pageno. This theoretically means a quadratic time complexity.

Hmm, this is a bug in pdfminer.six. As far as I know, one does have to traverse the page tree in order to get to a page with a particular index (will have to go check the spec), but the problem is that the implementation of get_pages is quite lazy and actually parses all of the pages even when a subset of pages is specified for extraction.

dhdaines commented 2 months ago

Oh, well, it turns out that parsing the pages after traversing the page tree isn't any more expensive than just traversing the tree itself, as initializing a PDFPage object just reads the attributes from the page tree and constructs an object. So, your fix is the right one and this doesn't actually need a fix to pdfminer.six.

jbarlow83 commented 2 weeks ago

Fixed in f77f701a