jimmejardine / qiqqa-open-source

The open-sourced version of the award-winning Qiqqa research management tool for Windows
GNU General Public License v3.0
381 stars 65 forks source link

The help should explain what is "textification" and how it affects the workflow #169

Open raindropsfromsky opened 4 years ago

raindropsfromsky commented 4 years ago

The status line often says "x pages to textify and y pages to OCR". But this peculiar word "textify" is not explained anywhere! It is not an industry-standard word used in any particular business.

Therefore, the Qiqqa manual and help file at website must explain this word, and how it affects the performance of Qiqqa (search results, and also the "save pdf as text" function).

GerHobbelt commented 4 years ago

Related to #165 and the discussion there.

raindropsfromsky commented 4 years ago

So, from the explanation you gave in the other issue, here's what I infer:

"textify" = Text extraction. This process is done on a file that already has searchable text. Since the text is already there, Qiqqa only assigns coordinates to each word.

"OCR" = Tesseract-based OCR. This process is done if the page has scanned image, and not machine-searchable text. After that, Qiqqa assigns coordinates to each word.

Please confirm?

GerHobbelt commented 4 years ago

Correct.

Nitpick: "textify" = extracting both the words and the coordinates.

No textify done, then there's nothing, just a file (which happens to be a PDF) and an (empty) metadata record in the library database.

HTH

raindropsfromsky commented 4 years ago

Qiqqa design is definitely inspired by a ransom note, which is composed from words cut out from newspapers and magazines. :D