jimmejardine / qiqqa-open-source

The open-sourced version of the award-winning Qiqqa research management tool for Windows
GNU General Public License v3.0
366 stars 60 forks source link

update Tesseract #35

Open GerHobbelt opened 4 years ago

GerHobbelt commented 4 years ago

https://github.com/tesseract-ocr/tesseract

GerHobbelt commented 4 years ago

Also consider offloading this to an external app entirely (as I have used different OCR applications in the past to cope with PDFs which the then-Tesseract/Qiqqa versions couldn't OCR properly).

See https://github.com/jbarlow83/OCRmyPDF for one example of this (which I encountered by way of https://tex.stackexchange.com/questions/11307/is-it-possible-to-produce-a-pdf-with-un-copyable-text while browsing around (La)TeX matters on a lazy afternoon).

IOW: see if we can get away with an entirely external OCR process which can deliver OCR/textualized PDF files for Qiqqa to process, so that Qiqqa can still make mark&copy available as before (every word is indexed with box coordinates i.e. position info in Lucene to help users find where in the PDF the sought phrase was located.

GerHobbelt commented 4 years ago

I'm learning something every day...

QiqqaOCR (at the time of this writing) already does something similar: Qiqqa attempts to use pdfdraw.exe -tt first to dump the text+coordinates per word from a given PDF, a.k.a. QiqqaOCR 'GROUP' mode.

When that doesn't fly, it uses Sorax PDF render library + custom region detection logic (#135; b0rk b0rk b0rk) + Tesseract v2 to perform an OCR action which also delivers words+coordinates for the given page, a.k.a. QiqqaOCR 'SINGLE' mode.

There's a NuPackage for Tesseract and C#, which would be a migration/upgrade path for the current antiquated Tesseract v2, but that website states it's for Tesseract v3 only (though there's apparently a 4.0 beta too: https://github.com/charlesw/tesseract/issues/428) and I'd rather ride the bleeding edge with Tesseract 5, so it's gonna be commandline work instead, I guess.

And then, totally off topic of course, is my intent to run PDFs through other OCR engines — as an alternative for Tesseract — such as ABBYY FineReader and ReadIris, as those are the ones I use on a more regular basis.

References / stuff I looked at while looking at Tesseract migration

GerHobbelt commented 4 years ago

As written in #135: upgrading to latest Tesseract implies:

Such a migration would of course impact the installer: maybe we should add code there to download the Tesseract installer and install it alongside Qiqqa — at least that would be the least size-increasing approach for the installer.