Closed aliemjay closed 2 weeks ago
To provide some context, this issue, along with #1378, resulted in a poor initial user experience for me. When working with a 1200-page file, the scan phase alone displayed an estimated time of 35 minutes, but it ended up taking over an hour to complete, when in reality it should have taken only a few seconds.
The issue is that pdfminer discards its buffer and then refills it every time seek
is called (source) even if the the seek target is within the previous buffer. This implies a big performance hit when the file size and BUFSIZ
are large enough.
@jbarlow83 how do you think this issue should be approached? One option is to wait for pdfminer/pdfminer.six#1030 to land upstream and then revert https://github.com/ocrmypdf/OCRmyPDF/commit/d35d00880687177c3d8de6371738e2d50bddda47. Another one is to lower BUFSIZ
to a manageable value, like ~1MB, which I think is more feasible as mid term fix
The typical input PDF to ocrmypdf has a lot of images, but not many parseable content streams, which is the only time BUFSIZ comes into play anyway, so this isn't a typical use case.
My intention is to wait for the pdfminer.six patch to be released in pdfminer.six.
I advise installing ocrmypdf in a virtual environment and using pip to install a patched pdfminer.six and/or ocrmypdf in the meantime.
I'm not inclined to lower BUFSIZ, because it is the maximum value to read, and the token splitting issue is much more painful (wrong behavior and in bizarre, complex ways) than poor performance. I respect pdfminer.six but in the long term I will probably replace that library with something more robust.
I'm not inclined to lower BUFSIZ, because it is the maximum value to read, and the token splitting issue is much more painful (wrong behavior and in bizarre, complex ways) than poor performance. I respect pdfminer.six but in the long term I will probably replace that library with something more robust.
I've been curious about alternatives as well... I gather that pikepdf is not able to replace pdfminer.six since you would have done that already? :)
I really like the pure-Python-ness of pdfminer.six but it has a lot of quirks, some of which have to be worked around in less-than-robust ways. For its main use case most of its API is superfluous and could be redone in a much more intuitive/Pythonic style...
The issue is that pdfminer discards its buffer and then refills it every time
seek
is called (source) even if the the seek target is within the previous buffer. This implies a big performance hit when the file size andBUFSIZ
are large enough.
Hmm. This is interesting - would my proposal to discard the lossy buffering entirely and simply read the entire stream into memory help or hurt in this case? If so I may just rewrite https://github.com/pdfminer/pdfminer.six/pull/1030 to do that...
Hmm. This is interesting - would my proposal to discard the lossy buffering entirely and simply read the entire stream into memory help or hurt in this case?
Hmm, that's an interesting thought. However, I don't support loading the entire stream into memory for two main reasons:
It limits the usability for large files, which is particularly problematic for a common library (as opposed to a specific application).
I believe the changes in your PR are suffiecient as it is. Increasing BUFSIZ is a temporary workaround that should be removed once your PR is merged.
- I believe the changes in your PR are suffiecient as it is. Increasing BUFSIZ is a temporary workaround that should be removed once your PR is merged.
Ah, I understand. Yes, basically the code in pdfminer.six is just very ill-suited to a very large BUFSIZ.
My other, less modest, proposal, is to rewrite the ~parser~ lexer to access the input stream directly using peek
for lookahead, so that the buffering gets done by someone else (presumably io.BufferedReader
). This is probably the ideal solution.
I searched for an existing lexer library for Python that will do this properly but can't seem to find one... ply
definitely assumes all of the data has been read into memory.
@dhdaines In terms of pikepdf and pdfminer.six -- lately I'm experimenting with rust PDF libraries and in particular pdfium-render seems far beyond what anything in Python is capable in terms of text rendering and layout capabilities. I've had a goal of reducing the number of external binaries ocrmypdf depends on, since that makes installation difficult for less technical users, and generally speaking it looks like the rust ecosystem now has most of what I need. I could pretty much eliminate all mandatory dependencies (tesseract, ghostscript) and some of the optional ones (pngquant) while still allowing optional access to those. So time permitting that is probably the future.
qpdf doesn't have code to do what pdfminer.six does, so if I were to add it to pikepdf I'd mostly be copying from pdfminer.six.
I'm grateful for pdfminer.six but it is "mature" and dated in many ways.
I agree with the buffered reader approach -- separating buffering concerns from lexing and parsing is probably the right appraoch. pdfminer.six's lexer/parser is complex. After examining it a while, I ended up deciding to make BUFSIZ large rather than actually take on the burden of trying to fix it, so I appreciate your effort in improving it.
@dhdaines In terms of pikepdf and pdfminer.six -- lately I'm experimenting with rust PDF libraries and in particular pdfium-render seems far beyond what anything in Python is capable in terms of text rendering and layout capabilities.
Interesting - from what I can see it's exactly equivalent to pypdfium2, i.e. a runtime binding to pdfium. This is certainly good for rendering and also adding objects and annotations (I'm using it for that...). When I checked (in a discussion regaring pdfplumber's use of pdfminer.six) it didn't seem to give low-level access to object and layout information, though. This is what I mean by the "actual use case for pdfminer.six" that could be refactored out.
I think the rendering quality is somewhat better with poppler + cairo, though I think there aren't good Python bindings for them and the API is much more limited.
I agree with the buffered reader approach -- separating buffering concerns from lexing and parsing is probably the right appraoch. pdfminer.six's lexer/parser is complex. After examining it a while, I ended up deciding to make BUFSIZ large rather than actually take on the burden of trying to fix it, so I appreciate your effort in improving it.
I would say needlessly complex, even... the PostScript subset used in content streams is maybe the only simple and elegant part of the PDF specification :)
low-level access to object and layout information, though. This is what I mean by the "actual use case for pdfminer.six" that could be refactored out.
Sorry for spamming this issue with tangential discussions (I'm never sure where to take them though...) but one more question here - OCRmyPDF and pdfplumber seem to be the main users of pdfminer.six at this point, and it appears that they are both using essentially the same part of it, i.e. PDFLayoutAnalyzer
and associated code, as well as access to the low-level structures (document catalog, page tree, etc.). Do you use anything else from it? Do you know, offhand, of other major users of pdfminer.six that use other stuff?
I think it's useful to have this functionality available in pure Python where it isn't hidden behind extra dependencies and abstractions. On the PDF generation side there is https://github.com/CourtBouillon/pydyf which fills a similar role. So (another question) would a library that does just those parts of pdfminer.six, with various mothballs removed, be useful to you? I'll discuss this with the pdfplumber and pdfminer.six maintainers...
pdfminer.six has a lot of users: https://github.com/pdfminer/pdfminer.six/network/dependents
I think it would be doable and an improvement to refactor pdfminer.six to have a backend, and then make that backend either pikepdf or pypdf, so that it retains the higher level features. I think that would be useful work that would remov duplication from pdfminer.six and likely eliminate PDF access bugs like the sort we're seeing.
pypdf is under active maintenance again and would be the way to go for a pure Python solution. It's come a long way recently, but I'm still skeptical it's as robust as qpdf (and thus pikepdf). qpdf does so much to quietly fix all kinds of malformed PDFs from obscure PDF generators, or just to support more obscure PDF structures.
Mainly this backend separation would amount to creating a class that reimplements pdftypes.py in pdfminer.six to obtain and convert objects differently.
Might also be worth checking if borb (AGPL-3, pure Python) has any interesting ideas in this area, but I think they focus a lot more on PDF writing.
Back to the original issue here. I did go and reimplement the "parser" (lexer) without buffering, and what I found was:
BufferedReader
implementation is faster than the explicit buffering code (obviously this will vary quite a bit depending on the file, OS, filesystem, etc).BytesIO
objects, simply because it ends up calling read
a lot more often, and there is method call overhead involved.But, more interestingly, it turns out that the vast majority of "parsing" (lexing) in pdfminer.six
is actually happening on BytesIO
anyway, because most recent PDF files are storing nearly everything in object streams, xref streams, etc, which are usually compressed, and even if they aren't compressed, pdfminer.six
will read them into memory entirely.
Short of using mmap to access the entire file, which has some portability issues, or decompressing streams on the fly (probably difficult to implement), the optimal solution is to remove the buffering, and have an alternate lexer for in-memory data that just uses a big regex.
Interested to see how the rewritten parser performs: https://github.com/pdfminer/pdfminer.six/pull/1041
After 6h of processing on one file and noticing what seemed like an exponential decrease in speed near the 700-page mark (in a 1300-page doc, 70 MB), I found this ticket. Until a 'proper fix', would it be a viable workaround to split the PDF (e.g. with pdftk), OCR every segment, then concatenate ? Or is it better to downgrade to a previous release
Fixed in f77f701a5
While playing with #1378, I noticed that the initial scanning time with
--redo-ocr
regressed in v16.4.3 in this commit. The newest version takes ~15x longer to scan the same file even when #1378 is fixed locally.I can't upload my local file, but the regression is noticeable with any file with size > 100MB and > 1000 pages. For reference
sample-10k.pdf
showed 10x regression. Starting from sample-100.pdf, it can be prepared with the following commands: