jimmejardine / qiqqa-open-source

The open-sourced version of the award-winning Qiqqa research management tool for Windows
GNU General Public License v3.0
375 stars 61 forks source link

Feature request: Let Qiqqa export (save) the OCRed pdf file #159

Open raindropsfromsky opened 4 years ago

raindropsfromsky commented 4 years ago

A lot of pdf files come as images, which makes them unreadable; and that makes the research difficult. Only Qiqqa has the unique feature of built-in OCR. But that is not exploited fully yet:

Let Qiqqa export (save) the OCRed pdf file, so that the user can replace the image-based pdf file with the searchable pdf file.

This will allow him to manipulate the text in the file with other tools outside Qiqqa.

GerHobbelt commented 4 years ago

Strongly related to #35.

Though I intend to end up with a Qiqqa which turns every PDF into a test-searchable PDF/A, we are still some ways away from that: this is essentially the same as "upgrading Tesseract" (#35) in that I want that upgrade to produce a background process that's flexible enough that various OCR issues can be dealt with in the Qiqqa OCR process, which are bothering me, e.g. https://github.com/jimmejardine/qiqqa-open-source/issues/135#issuecomment-569827317-permalink (even though #135 is marked as WontFix: it's a problem in the current Qiqqa OCR process that should be fixed or at least fixable in the new Qiqqa OCR process)

raindropsfromsky commented 4 years ago

I may have failed to convey my point: I did not even know that Qiqqa sometimes fails to detect the regions properly (defective layout analysis).

My point was that once the pdf is OCRed, it is stored (hidden) somewhere inside Qiqqa. Instead, it should be accessible to the user.

In fact, I am realizing it now that it is best NOT to make it available as a pdf: If the user can see it as doc or docx, he can correct the file manually. (On the other hand, a pdf is non-editable, so the benefit of OCR is lost).

If the user wants the final file as a readable pdf, converting the docx to pdf is a trivial task (e.g. use LibreOffice Writer to export as pdf, or use a virtual printer to save as pdf.)

I am not aware of the structure of a pdf file, but I read in some help on Qiqqa that it stores the OCRed text as a layer in the original file (some sort of overlay). Thus it is able to correlate the original with the text that is extracted from it.).

If my understanding is correct, it should not be a big effort to just export the extracted text (without the original layer). Then can it be brought forward to a more immediate release?

GerHobbelt commented 4 years ago

Alas, things are a bit hairier inside. See the doc about the Qiqqa OCR process ATM.


BTW, AFAICT Qiqqa already exports the OCR-ed text to file via this click path:

[View PDF Document (by doubleclicking it in the list] > Miscellaneous PDF Goodies (= the grey wheel in the toolbar) > (dropdown menu) > Convert Your PDF To Text

which shows the text (plus any images Qiqqa is able to extract from the PDF) in a new panel, where you choose to either Print it or export it To Word.

(Off Topic: 🤔 might be nice to add a 'To HTML' button there, but that's low prio for me right now.)

2020-03-23_21-07-29

GerHobbelt commented 4 years ago

My point was that once the pdf is OCRed, it is stored (hidden) somewhere inside Qiqqa. Instead, it should be accessible to the user.

In fact, I am realizing it now that it is best NOT to make it available as a pdf: If the user can see it as doc or docx, he can correct the file manually. (On the other hand, a pdf is non-editable, so the benefit of OCR is lost).

If the user wants the final file as a readable pdf, converting the docx to pdf is a trivial task (e.g. use LibreOffice Writer to export as pdf, or use a virtual printer to save as pdf.)

I am not aware of the structure of a pdf file, but I read in some help on Qiqqa that it stores the OCRed text as a layer in the original file (some sort of overlay). Thus it is able to correlate the original with the text that is extracted from it.).

The idea that's been running around in my head for a while, given #35 and all this, is to migrate the Qiqqa internal ocr cache to some form of hOCR format (as I want the 'text layer embedded' PDF copies for myself, next to the originals 😉 ).

It just so happens that Tesseract 4 and 5(beta) can output hOCR format directly, which, if I understand correctly (haven't taken the time to look into this more deeply than a bit of google and wikipedia yet) can be viewed in a modern web browser as hOCR is enriched HTML of sorts.

And that means we MAY be able to point users to other applications when they wish to manually correct the text layer/output.

raindropsfromsky commented 4 years ago

Sounds exciting! 👍