ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.12k stars 1.02k forks source link

[Bug]: Scan time regression in 16.4.3 with `--redo-ocr` #1380

Closed aliemjay closed 2 weeks ago

aliemjay commented 2 months ago

While playing with #1378, I noticed that the initial scanning time with --redo-ocr regressed in v16.4.3 in this commit. The newest version takes ~15x longer to scan the same file even when #1378 is fixed locally.

I can't upload my local file, but the regression is noticeable with any file with size > 100MB and > 1000 pages. For reference sample-10k.pdf showed 10x regression. Starting from sample-100.pdf, it can be prepared with the following commands:

pdfunite $(for i in {1..10}; do echo sample-100.pdf; done) sample-1k.pdf
pdfunite $(for i in {1..10}; do echo sample-1k.pdf; done) sample-10k.pdf
aliemjay commented 2 months ago

To provide some context, this issue, along with #1378, resulted in a poor initial user experience for me. When working with a 1200-page file, the scan phase alone displayed an estimated time of 35 minutes, but it ended up taking over an hour to complete, when in reality it should have taken only a few seconds.

The issue is that pdfminer discards its buffer and then refills it every time seek is called (source) even if the the seek target is within the previous buffer. This implies a big performance hit when the file size and BUFSIZ are large enough.

@jbarlow83 how do you think this issue should be approached? One option is to wait for pdfminer/pdfminer.six#1030 to land upstream and then revert https://github.com/ocrmypdf/OCRmyPDF/commit/d35d00880687177c3d8de6371738e2d50bddda47. Another one is to lower BUFSIZ to a manageable value, like ~1MB, which I think is more feasible as mid term fix

jbarlow83 commented 2 months ago

The typical input PDF to ocrmypdf has a lot of images, but not many parseable content streams, which is the only time BUFSIZ comes into play anyway, so this isn't a typical use case.

My intention is to wait for the pdfminer.six patch to be released in pdfminer.six.

I advise installing ocrmypdf in a virtual environment and using pip to install a patched pdfminer.six and/or ocrmypdf in the meantime.

I'm not inclined to lower BUFSIZ, because it is the maximum value to read, and the token splitting issue is much more painful (wrong behavior and in bizarre, complex ways) than poor performance. I respect pdfminer.six but in the long term I will probably replace that library with something more robust.

dhdaines commented 2 months ago

I'm not inclined to lower BUFSIZ, because it is the maximum value to read, and the token splitting issue is much more painful (wrong behavior and in bizarre, complex ways) than poor performance. I respect pdfminer.six but in the long term I will probably replace that library with something more robust.

I've been curious about alternatives as well... I gather that pikepdf is not able to replace pdfminer.six since you would have done that already? :)

dhdaines commented 2 months ago

I really like the pure-Python-ness of pdfminer.six but it has a lot of quirks, some of which have to be worked around in less-than-robust ways. For its main use case most of its API is superfluous and could be redone in a much more intuitive/Pythonic style...

dhdaines commented 2 months ago

The issue is that pdfminer discards its buffer and then refills it every time seek is called (source) even if the the seek target is within the previous buffer. This implies a big performance hit when the file size and BUFSIZ are large enough.

Hmm. This is interesting - would my proposal to discard the lossy buffering entirely and simply read the entire stream into memory help or hurt in this case? If so I may just rewrite https://github.com/pdfminer/pdfminer.six/pull/1030 to do that...

aliemjay commented 2 months ago

Hmm. This is interesting - would my proposal to discard the lossy buffering entirely and simply read the entire stream into memory help or hurt in this case?

Hmm, that's an interesting thought. However, I don't support loading the entire stream into memory for two main reasons:

dhdaines commented 2 months ago
  • I believe the changes in your PR are suffiecient as it is. Increasing BUFSIZ is a temporary workaround that should be removed once your PR is merged.

Ah, I understand. Yes, basically the code in pdfminer.six is just very ill-suited to a very large BUFSIZ.

My other, less modest, proposal, is to rewrite the ~parser~ lexer to access the input stream directly using peek for lookahead, so that the buffering gets done by someone else (presumably io.BufferedReader). This is probably the ideal solution.

I searched for an existing lexer library for Python that will do this properly but can't seem to find one... ply definitely assumes all of the data has been read into memory.

jbarlow83 commented 2 months ago

@dhdaines In terms of pikepdf and pdfminer.six -- lately I'm experimenting with rust PDF libraries and in particular pdfium-render seems far beyond what anything in Python is capable in terms of text rendering and layout capabilities. I've had a goal of reducing the number of external binaries ocrmypdf depends on, since that makes installation difficult for less technical users, and generally speaking it looks like the rust ecosystem now has most of what I need. I could pretty much eliminate all mandatory dependencies (tesseract, ghostscript) and some of the optional ones (pngquant) while still allowing optional access to those. So time permitting that is probably the future.

qpdf doesn't have code to do what pdfminer.six does, so if I were to add it to pikepdf I'd mostly be copying from pdfminer.six.

I'm grateful for pdfminer.six but it is "mature" and dated in many ways.

I agree with the buffered reader approach -- separating buffering concerns from lexing and parsing is probably the right appraoch. pdfminer.six's lexer/parser is complex. After examining it a while, I ended up deciding to make BUFSIZ large rather than actually take on the burden of trying to fix it, so I appreciate your effort in improving it.

dhdaines commented 2 months ago

@dhdaines In terms of pikepdf and pdfminer.six -- lately I'm experimenting with rust PDF libraries and in particular pdfium-render seems far beyond what anything in Python is capable in terms of text rendering and layout capabilities.

Interesting - from what I can see it's exactly equivalent to pypdfium2, i.e. a runtime binding to pdfium. This is certainly good for rendering and also adding objects and annotations (I'm using it for that...). When I checked (in a discussion regaring pdfplumber's use of pdfminer.six) it didn't seem to give low-level access to object and layout information, though. This is what I mean by the "actual use case for pdfminer.six" that could be refactored out.

I think the rendering quality is somewhat better with poppler + cairo, though I think there aren't good Python bindings for them and the API is much more limited.

I agree with the buffered reader approach -- separating buffering concerns from lexing and parsing is probably the right appraoch. pdfminer.six's lexer/parser is complex. After examining it a while, I ended up deciding to make BUFSIZ large rather than actually take on the burden of trying to fix it, so I appreciate your effort in improving it.

I would say needlessly complex, even... the PostScript subset used in content streams is maybe the only simple and elegant part of the PDF specification :)

dhdaines commented 2 months ago

low-level access to object and layout information, though. This is what I mean by the "actual use case for pdfminer.six" that could be refactored out.

Sorry for spamming this issue with tangential discussions (I'm never sure where to take them though...) but one more question here - OCRmyPDF and pdfplumber seem to be the main users of pdfminer.six at this point, and it appears that they are both using essentially the same part of it, i.e. PDFLayoutAnalyzer and associated code, as well as access to the low-level structures (document catalog, page tree, etc.). Do you use anything else from it? Do you know, offhand, of other major users of pdfminer.six that use other stuff?

I think it's useful to have this functionality available in pure Python where it isn't hidden behind extra dependencies and abstractions. On the PDF generation side there is https://github.com/CourtBouillon/pydyf which fills a similar role. So (another question) would a library that does just those parts of pdfminer.six, with various mothballs removed, be useful to you? I'll discuss this with the pdfplumber and pdfminer.six maintainers...

jbarlow83 commented 1 month ago

pdfminer.six has a lot of users: https://github.com/pdfminer/pdfminer.six/network/dependents

I think it would be doable and an improvement to refactor pdfminer.six to have a backend, and then make that backend either pikepdf or pypdf, so that it retains the higher level features. I think that would be useful work that would remov duplication from pdfminer.six and likely eliminate PDF access bugs like the sort we're seeing.

pypdf is under active maintenance again and would be the way to go for a pure Python solution. It's come a long way recently, but I'm still skeptical it's as robust as qpdf (and thus pikepdf). qpdf does so much to quietly fix all kinds of malformed PDFs from obscure PDF generators, or just to support more obscure PDF structures.

Mainly this backend separation would amount to creating a class that reimplements pdftypes.py in pdfminer.six to obtain and convert objects differently.

Might also be worth checking if borb (AGPL-3, pure Python) has any interesting ideas in this area, but I think they focus a lot more on PDF writing.

dhdaines commented 1 month ago

Back to the original issue here. I did go and reimplement the "parser" (lexer) without buffering, and what I found was:

  1. This is about 15% faster on actual files since cPython's BufferedReader implementation is faster than the explicit buffering code (obviously this will vary quite a bit depending on the file, OS, filesystem, etc).
  2. It is also about 25% slower than the current parser on BytesIO objects, simply because it ends up calling read a lot more often, and there is method call overhead involved.

But, more interestingly, it turns out that the vast majority of "parsing" (lexing) in pdfminer.six is actually happening on BytesIO anyway, because most recent PDF files are storing nearly everything in object streams, xref streams, etc, which are usually compressed, and even if they aren't compressed, pdfminer.six will read them into memory entirely.

Short of using mmap to access the entire file, which has some portability issues, or decompressing streams on the fly (probably difficult to implement), the optimal solution is to remove the buffering, and have an alternate lexer for in-memory data that just uses a big regex.

dhdaines commented 1 month ago

Interested to see how the rewritten parser performs: https://github.com/pdfminer/pdfminer.six/pull/1041

fenugrec commented 1 month ago

After 6h of processing on one file and noticing what seemed like an exponential decrease in speed near the 700-page mark (in a 1300-page doc, 70 MB), I found this ticket. Until a 'proper fix', would it be a viable workaround to split the PDF (e.g. with pdftk), OCR every segment, then concatenate ? Or is it better to downgrade to a previous release

jbarlow83 commented 2 weeks ago

Fixed in f77f701a5