Combined PDFs revisited

RNCTX commented 6 years ago

I see this issue mentioned once here while back, and it seems that the user in question had a commercial project in mind which was not feasible.

But the idea of a merged text/image PDF is now included functionality in the 4-xx versions of Tesseract, and much work has gone into improving that sort of output over there. There is a python library which integrates the same tool chain you guys are using (unpaper, Ghostscript, etc) called OCRmyPDF.

This is all somewhat of an intersection of projects that solve different problems but not all problems.

I found your project after abandoning Nextcloud/owncloud because their PDF plugins rely on elasticsearch as a text backend store. That’s great except for the fact that elasticsearch’s input plugin for parsing files was designed for log files, not PDF files, and has a low file size limit (around 30mb iirc).
While elasticsearch may be big on (speed) performance it’s not a relational DB so it’s small in other ways. It comes with the usual NoSQL gotchas in terms of locking states with multiple users. While paperless is SQLite out of the box it could just as easily be a real DB, which is a huge plus.
If/when a new UI happens outside of the scope of the Django admin panel, it makes sense to integrate PDF.js as a viewer, which means selectable text is an obvious feature people will want. Since tesseract can output both by simply varying CLI options (merged and plain text), there’s little reason not to do so. Tesseract has done most of the hard work in all of this stuff, it’s just a matter of using functionality that is already there in a different Python wrapper that already exists.
OCRmyPDF uses MuPDF for the initial render, which I have found to produce superior results in terms of generating OCR-ready files. MuPDF and its Python wrapper are AGPL licensed (not sure if you care about that?)

Big glaring ‘gotcha’:

OCRmyPDF says it is Python3 only. Thoughts on that?

ddddavidmartin commented 6 years ago

I very much like the idea of having a generated text layer in scanned PDFs!

RNCTX commented 6 years ago

I'll post some background reading on why I'm going to work on replacing the existing library with OCRmyPDF on my installation regardless, good info for anyone who is working on handling PDF files with automated open-source tools...

The catch with PDF text layers is that they have heuristic spacing. This has always been part of the PDF spec, there is no way to simply 'turn it off', so the quality of PDF text layer rendering is not consistent.

For the end user: while paperless might save OCR'd plain text to the database perfectly, once a user say... opens a PDF with a text layer in their browser and searches through an individual document in their local PDF renderer with ctrl-F, that's where things get more complex.

At the moment, in the desktop GUI world there are basically four competing renderers which will give varying results. PDF.js (Mozilla's, included in Firefox), PDFium (Google's, included in Chrome), Adobe's (included in Acrobat), and Apple's (included in OSX Preview, Safari, and iOS).

This issue from Tesseract awhile back illustrates the issues that arise when a PDF with a text layer goes into/through/out-of varying Unix/Linux tools that modify them. There are some samples in that thread which illustrate the wildly varying quality of output of text layer quality.

This issue is coming up on the PDF.js radar as well, since unfortunately theirs (and by extension a lot of others, since PDF.js is the easiest to implement into third-party projects) is the worst in terms of rendering spacing and line breaks properly.

The current status and short answer is:

Using the right toolchain it is possible to get good results in OSX/iOS despite Apple's PDF text renderer being not-very-good, Adobe's and Google's renderers on the other hand are very comparable and by far the best of the bunch. PDF.js is the worst of the four and there is no way to get consistently good results with it, but the issue is on their radar (linked above) and hopefully will get some work in the near future. With all of this in mind, the toolchain used in OCRmyPDF produces the best results that I have found for OSX/Safari/iOS, and by extension the best results period, since good performance from Adobe and Google renderers is a given, and bad performance from PDF.js is a given, so Apple's is the only one that can feasibly be improved as of right this minute.

RNCTX commented 6 years ago

Here are some samples of the current performance going in on a bad source document:

centuryiib-origscan.pdf

As it says this is a scan of an aircraft's autopilot maintenance manual from 1974, originally printed on unknown equipment (probably a typewriter), and scanned rather badly with a lot of skew and either a dirty scanner or (more likely) dirty pages.

The default settings in paperless for pyocr fail to deskew it (much?) with "out of deviation range - NO ROTATING" and a sample of the resulting text is below (errors in bold)...

12~10-74 Rev. . 3-01-76 Rev. SECTION TV. 3-22-76 Rev. 5-18-76 Rev. GROUND CHECKS AND FLIGHT ADJUSTMENT PROCEDURES CENTURY IIB AUTQPILOTS Drawing No. 694911-1 The Century IIB Autopilot is an "Open Loop’ system which responds only to the dynamics of the aircraft in flight, thus the only ground checks that can be accomplished are functional checks as described in this bulletin.

GROUND CHECKS: lL. 10. 68593 Remove console face plate by removing the roll knob and the two face plate mounting screws that are exposed. After removing the face plate, reinstall the roll knob. Start aircraft engine to obtain gyro stability. Adjust vacuum regulator to obtain 4.5 to 5.0" vacuum. Center roll knob. Rotate aircraft control wheel to level flight (neutral) position. Push A/P ON/OFF switch ON. Move control wheel right and left to check servo engagement and that servo can be overridden.

The same file via OCRmyPDF deskews properly, here is a copy/paste of unprocessed text from the same document OCR'd by OCRmyPDF with unpaper cleaning the input via OCRmyPDF's defaults (errors in bold)...

12-10-74 Rev. 3-01-76 Rev. SECTION IVY. 3-22-76 Rev. 5-18-76 Rev. GROUND CHECKS AND FLIGHT ADJUSTMENT PROCEDURES CENTURY ITB AUTOPILOTS Drawing No. 69AG11-1 The Century IIB Autopilot is an "Open Loop’ system which responds only to the dynamics of the aircraft in flight, thus the only ground checks that can be accomplished are functional checks as described in this bulletin. GROUND CHECKS: lL. Remove console face plate by removing the roll knob and the two face plate mounting screws that are exposed. After removing the face plate, reinstall the roll knob. Start aircraft erdgine to obtain gyro stability. Adjust vacuum regulator to obtain 4.5 to 5.0" vacuum. Center rell knob. Rotate aircraft control wheel to level flight (neutral) position. Push A/P ON/OFF switch ON. Move control wheel right and left to check servo engagement and that servo can be overridden.

Disregarding the spacing because it hasn't been through your re.sub space stripping routine, quality is a push. In terms of search functionality, gained two misspelled words after the more aggressive de-skew, but also gained a really de-skewed end result rather than a partially de-skewed one. Also gained better detection of punctuation 'at the edges' (the dates in the document header).

RNCTX commented 6 years ago

And perhaps most importantly, OCRmyPDF has three different rendering options, four different PDF output options (regular or all flavors of PDF-A), both pre and post 'clean' options, pre-upscaling, output compression for PDF-A, and handles split/join itself so it wants PDF input and output.

More options + simplified interface ;).

danielquinn commented 6 years ago

First off, apologies for the late reply. Ironically, I've been using Paperless a lot lately, while I apply for my visa to continue living here. Now that that's out of the way though, I hope to have more weekends for this project.

OCRmyPDF sounds pretty promising, and it's being actively developed, which is nice. Paperless only supports Python 3, so that's not a problem, and the project is GPL, not AGPL, but I believe the two are compatible anyway.

Currently, we're using pyOCR, which is just a simple wrapper around Tesseract, but I'm not married to that library if we can get more with a different one.

@RNCTX, if the idea of adapting Paperless to use OCRmyPDF, and by extension, allow us to have a formatted text layer in the scanned documents is the sort of thing that turns your crank, I am happy to field that pull request! I just want to be clear that these changes would work on a Linux system too, correct? This isn't a Mac-only thing, right?

Feel free to open a pull request prefixed with WIP: and ask questions if you have any. I'm happy to help with guidance for implementation, but I don't have the time to write all of that code myself at the moment.

This does sound promising though, and it sounds like you've got a lot more background on this subject than me anyway :-)

RNCTX commented 6 years ago

Hey Dan, thanks for the reply. Yes, I've been tinkering with PDF processing for awhile, but have a need for larger scale than single documents. Yours is one of the open-source framework alternatives that I was looking at for getting text not just into files but into a database that's searchable at a greater-than-file level. There is a community that seems to stem from Tesseract into various other tools for scripting/batch processing with Tesseract more efficiently. OCRmyPDF is of that community.

No worries on the lateness, I have a mountain of failed prospects for what I'm looking to do sitting in a folder from this past month.

As pointed out by a few others before me who are doing this very thing (for instance the Ambar project, their blog is a good source of 'gotchas' with file processing), I'm not sure after looking under the hood at a lot of this stuff that I could implement it within the standards of another project. I've got so many outlier examples of files that break things when you try to process them that I would be terrified to send a merge to someone else and have their entire community complaining a week later about how their document server crashes all the time (only to find out that an oddball file that appears benign in any desktop OS renderer is to blame).

As an example I found a PDF made with quarkXpress in the late 1990s that is 24mb for an 836 page document, and when each page is extracted with any open source tool currently available, each page is also 24mb. What a mess.

The best utility I have found for handling all of these possible cases is Apache Tika, so I've already gotten to work on a framework that employs Solr as the data store.

Sorry for costing you reading time! I'm already going a different way.

danielquinn commented 6 years ago

I've got so many outlier examples of files that break things when you try to process them that I would be terrified to send a merge to someone else and have their entire community complaining a week later about how their document server crashes all the time (only to find out that an oddball file that appears benign in any desktop OS renderer is to blame).

This is the story of my life since I started this beast in 2015 :-) No worries. I'll close this for now, but if you find a means of doing some semblance of this in a uniform fashion, I'd love to see it. If not as a pull request, at least as a code snippet that I might try to adapt to this project.

the-paperless-project / paperless

Combined PDFs revisited #365