update Tesseract - Githubissues

GerHobbelt commented 4 years ago

https://github.com/tesseract-ocr/tesseract

GerHobbelt commented 4 years ago

Also consider offloading this to an external app entirely (as I have used different OCR applications in the past to cope with PDFs which the then-Tesseract/Qiqqa versions couldn't OCR properly).

See https://github.com/jbarlow83/OCRmyPDF for one example of this (which I encountered by way of https://tex.stackexchange.com/questions/11307/is-it-possible-to-produce-a-pdf-with-un-copyable-text while browsing around (La)TeX matters on a lazy afternoon).

IOW: see if we can get away with an entirely external OCR process which can deliver OCR/textualized PDF files for Qiqqa to process, so that Qiqqa can still make mark&copy available as before (every word is indexed with box coordinates i.e. position info in Lucene to help users find where in the PDF the sought phrase was located.

GerHobbelt commented 4 years ago

I'm learning something every day...

QiqqaOCR (at the time of this writing) already does something similar: Qiqqa attempts to use pdfdraw.exe -tt first to dump the text+coordinates per word from a given PDF, a.k.a. QiqqaOCR 'GROUP' mode.

When that doesn't fly, it uses Sorax PDF render library + custom region detection logic (#135; b0rk b0rk b0rk) + Tesseract v2 to perform an OCR action which also delivers words+coordinates for the given page, a.k.a. QiqqaOCR 'SINGLE' mode.

There's a NuPackage for Tesseract and C#, which would be a migration/upgrade path for the current antiquated Tesseract v2, but that website states it's for Tesseract v3 only (though there's apparently a 4.0 beta too: https://github.com/charlesw/tesseract/issues/428) and I'd rather ride the bleeding edge with Tesseract 5, so it's gonna be commandline work instead, I guess.

And then, totally off topic of course, is my intent to run PDFs through other OCR engines — as an alternative for Tesseract — such as ABBYY FineReader and ReadIris, as those are the ones I use on a more regular basis.

References / stuff I looked at while looking at Tesseract migration

https://github.com/charlesw/tesseract/#user-content-tesseract-language-data
https://github.com/tesseract-ocr/tesseract#user-content-brief-history
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
https://github.com/UB-Mannheim/tesseract/wiki
https://groups.google.com/forum/#!msg/tesseract-ocr/Wdh_JJwnw94/24JHDYQbBQAJ
https://github.com/charlesw/tesseract/issues/428
http://www.mythoughtspot.com/2015/01/06/pdf-to-tiff-to-txt-bash-script-automation/ (TIFF can be multipage, hence a single run is all it needs to produce an A/PDF using Tesseract)
https://github.com/LeoFCardoso/pdf2pdfocr
https://github.com/tesseract-ocr/tesseract/wiki/User-Projects-%E2%80%93-3rdParty
https://github.com/itext/itext7-dotnet (via https://www.codingame.com/playgrounds/10058/scanned-pdf-to-ocr-textsearchable-pdf-using-c )
https://www.codeproject.com/Articles/1303061/Convert-all-files-to-searchable-PDFs (nice! Script also converts Office docs to PDF)
http://guides.library.illinois.edu/c.php?g=347520&p=4121426
https://dantonnoriega.github.io/ultinomics.org/post/2016-03-29-pdf-text-convert-ocr-tesseract.html + https://en.wikipedia.org/wiki/List_of_precomposed_Latin_characters_in_Unicode describe the woes around searchable text which contains ligatures: since Tesseract recognizes ligatures, this info is handy to have as this is useful for preprocessing OCR text and user searches hitting our Lucene index!)
http://www.fmwconcepts.com/imagemagick/textcleaner/index.php
https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage
https://csharp.hotexamples.com/examples/Tesseract/TesseractEngine/-/php-tesseractengine-class-examples.html

GerHobbelt commented 4 years ago

As written in #135: upgrading to latest Tesseract implies:

Such a migration would of course impact the installer: maybe we should add code there to download the Tesseract installer and install it alongside Qiqqa — at least that would be the least size-increasing approach for the installer.

jimmejardine / qiqqa-open-source

update Tesseract #35

I'm learning something every day...

References / stuff I looked at while looking at Tesseract migration