Closed aliemjay closed 2 weeks ago
Note that get_page_analysis runs in a worker thread/process context so a fix that passes all test may not be super straightforward.
Here is my patch, it passed all tests locally.
diff --git a/src/ocrmypdf/pdfinfo/layout.py b/src/ocrmypdf/pdfinfo/layout.py
index 83d5d32b..fbef6a0d 100644
--- a/src/ocrmypdf/pdfinfo/layout.py
+++ b/src/ocrmypdf/pdfinfo/layout.py
@@ -288,6 +288,23 @@ def patch_pdfminer(pscript5_mode: bool):
else:
yield
+class PageCache:
+ cache = {}
+
+ @classmethod
+ def get_cached(cls, infile: PathLike, pageno: int):
+ if infile in cls.cache:
+ for cache_pageno, cache_page in cls.cache[infile]:
+ if cache_pageno == pageno:
+ return cache_page
+ elif cache_pageno > pageno:
+ break
+ else:
+ return None
+
+ f = Path(infile).open('rb')
+ cls.cache[infile] = enumerate(PDFPage.get_pages(f))
+ return cls.get_cached(infile, pageno)
def get_page_analysis(
infile: PathLike, pageno: int, pscript5_mode: bool
@@ -306,8 +323,7 @@ def get_page_analysis(
with patch_pdfminer(pscript5_mode):
try:
with Path(infile).open('rb') as f:
- page_iter = PDFPage.get_pages(f, pagenos=[pageno], maxpages=0)
- page = next(page_iter, None)
+ page = PageCache.get_cached(infile, pageno)
if page is None:
raise InputFileError(
f"pdfminer could not process page {pageno} (counting from 0)."
For testing, this is a file with 10k blank pages: blank-10k.pdf
Real-world files require much less pages for this bug to be noticeable.
The test was done under v16.4.2
, because v16.4.3
introduced an unrelated regression to scan times. I'll file a separate issue soon.
$ ocrmypdf blank-10k.pdf out.pdf --redo-ocr
I see a few issues with the approach taken in the patch. In particular the opened file handle is not explicit closed appropriately - this could be problematic for PyPy.
I see a few issues with the approach taken in the patch. In particular the opened file handle is not explicit closed appropriately - this could be problematic for PyPy.
Can we add a new method, PageCache.clear, which closes all open files and is called somewhere in the pipeline upon completion of the scan? I think this is the only way to fix it without disturbing the public api.
this could be problematic for PyPy.
I think you mean PyPi the repository, as the issue is cross-platform compatibility with Windows?
I think that PageCache needs to be attached to PdfInfo and created as a distinct object rather than using global state. PdfInfo should create a PageCache and pass it on to each PageInfo; after page information is gathered, PageCache can be cleared.
PyPi?
No, PyPy, the alternative Python implementation, which ocrmypdf supports.
Upon investigation, I traced the issue back to
The inefficient implementation of
PDFPage.get_pages
seems to iterate over all previous pages before getting topageno
. This theoretically means a quadratic time complexity.
Hmm, this is a bug in pdfminer.six. As far as I know, one does have to traverse the page tree in order to get to a page with a particular index (will have to go check the spec), but the problem is that the implementation of get_pages
is quite lazy and actually parses all of the pages even when a subset of pages is specified for extraction.
Oh, well, it turns out that parsing the pages after traversing the page tree isn't any more expensive than just traversing the tree itself, as initializing a PDFPage
object just reads the attributes from the page tree and constructs an object. So, your fix is the right one and this doesn't actually need a fix to pdfminer.six.
Fixed in f77f701a
When using OCRMyPDF with the
--redo-ocr
option, I noticed that the "Scanning contents" phase can become excessively long for large input files.Surprisingly, the total processing time seems to scale non-linearly with the page count making the estimated time-to-complete feature useless.
Here are the progress bar timings when scanning a 1000 pages file on Core i5 12400 under v16.4.2:
Upon investigation, I traced the issue back to https://github.com/ocrmypdf/OCRmyPDF/blob/5e1e2497abd92ac18caf4e6f7ec439dbe59aaf2a/src/ocrmypdf/pdfinfo/layout.py#L309
The inefficient implementation of
PDFPage.get_pages
seems to iterate over all previous pages before getting topageno
. This theoretically means a quadratic time complexity.The following quick and dirty patch reduced the scan time from 300s to 10s for the same file