holtwick / pdfify

Issue tracking for PDFify. To upvote features give a 👍
https://pdfify.app/future?ref=github&kw=start
12 stars 0 forks source link

Improve the information to help decide which OCR option to use #38

Closed acrostich closed 3 years ago

acrostich commented 3 years ago

One can choose Tesseract or Apple Vision. Tesseract is marked as recommended but without any rationale neither in in the app nor on https://pdfify.app/en/help#ocr

Also there's a text at the bottom of the Textrecognition engine preference screen suggesting that when using Tesseract one should choose only one language. Again there is no rationale for this.

From some quick text on Mojave with both an English document and a French one, Tesseract seems to work whether multiple languages are selected (I had selected English, French and Korean) or only the appropriate one. So what differences does it make if any? Also I didn't try but can it do multi language OCR in case of a multi language document?

I tried once the Apple Vision on an English document (as on Mojave) and it seemed to generate the exact same OCRd text than with Tesseract so either there's a bug or I may have made a mistake. I haven't had the time to explore this further.

Due to another issue, already logged, it is not obvious when the OCR operation finishes so I didn't get a good feel if using one or the other of the engine with one or several languages selected affects the speed of the OCR.

Ideally there should be in the app where one selects the engine, i.e., on the preference screen some indication that helps one makes the best choice for their situation. For instance indicating if one option is more accurate, faster, using less memory, etc. Some benchmarks and/or further information could also be provided on the help page.

Related to the need for more information re the OCR engine. When trying to OCR some 'large documents', the message 'Please note that processing large documents will be slow or can even fail, because all operations are happening in memory. In case of problems or questions please contact support.' is displayed. How large is a large document? How slow is slow? How will I know if it has failed? If it has failed, is it a complete failure (it seems to OCR some pages and then silently fails) or not? If not can I restart OCR and it will just OCR the missing pages?

Basically all the choices made by the user should be informed choices, so the information to make these choices must be available.

holtwick commented 3 years ago

Thanks, it all makes sense. I just strip it down to bullet points:

acrostich commented 3 years ago

It's not so much 'clarify language selection' than 'explain the consequences of the choice'. It is already clear one can select multiple languages, but it is suggested one shouldn't. There may be a good reason for that, then what it is. For instance are there circumstances where it is better to select only one language? The simplest choice, if using Tesseract, would be to select all the languages one is likely to need to OCR and never touch this setting again. Why instead select just one language as suggested?

When you write above 'yes you can choose multiple ones, it should then optimize results also for mixed contents' that suggests that multiple selection is more useful and has no downside. When you write in the preference dialog 'you should preferably go with one only', that gives the opposite message. I suspect both might be true in different circumstances, but I don't have elements to base my decision as to which of your statement is right for my present circumstances.

And either you include in that bullet point clarifying the choice of OCR engine too, or you need another bullet point.

holtwick commented 3 years ago

I see. The issue is, that multiple languages slow down the OCR process.

acrostich commented 3 years ago

That's exactly the type of information that would be useful. I now know that I can decide whether to select several languages or not depending on whether I prefer convenience or speed. I can now make an informed decision.

holtwick commented 3 years ago

https://github.com/tesseract-ocr/tessdoc/blob/9938c2bcc2ce3fe6056ee97df636af7b9fb58ac6/Command-Line-Usage.md#using-multiple-languages

holtwick commented 3 years ago

image

de = Es können mehrere Sprachen ausgewählt werden. Dadurch wird das Ergebnis besser für Texte mit gemischten Sprachen, die Verarbeitung verlangsamt sich aber. Die Sprachen werden in der Reihenfolge abgearbeitet, in der sie ausgewählt wurden.

acrostich commented 3 years ago

That's great but the last bit. If the languages are processed in the order in which they are selected then it would be most helpful to have an indication of the order in which they were selected. A good solution would be to move the selected languages to the top of the list in the order of selection.

holtwick commented 3 years ago

Q: Check if Apple Vision OCR is really applied A: It is

holtwick commented 3 years ago

I agree, order should become clearer. Though it is available already:

Holtwick-PDFify-2020-pSRmGajN@2x

acrostich commented 3 years ago

Brilliant. I suggest you make this order by activity the default. I hadn't noticed it was an option and it is the most useful view.

holtwick commented 3 years ago

With 3.3.2 the default ordering is active