Decoding is slow when multiple languages are used

DorisGM commented 5 years ago

Summary: Decoding is slow when multiple languages are used.Can I dynamically switch languages to decode images？ I want to support multi languages but only a language when decode image . Sometime eng or Sometime ara. Not one sentence include many languages.

Steps to reproduce the issue:

I want to support multi languages but only a language when decode image . Sometime eng or Sometime ara. Not one sentence include many languages.
I had init TessBaseApi by eng + ara + msa to decode several image which maybe english or arabic. 3.When I init only english , It decoded image fast. But if I init TessBaseApi by eng + ara + msa, it decoded it very slow by a same English sentence.

Expected result: I want when I init TessBaseApi by eng + ara + msa can fast as only init by eng. Or maybe I need to switch language dynamically by myself when I decode different language image. And If I switch init different language dynamically, whether it will influence decode performance and should I invoke TessBaseApi.clear before I switch.

Actual result: Decoding is slow when multiple languages are used

Tess-two version: 9.0.0

Android version: 7.0.0

Phone/device model: Android TV Amlogic 905X

Phone/device architecture (armeabi, armeabi-v7a, x86, mips, arm64-v8a, x86_64, mips64): arm64-v8a

Link to training data used: https://github.com/tesseract-ocr/tessdata/tree/3.04.00

Link to image used as input:

ott_subtitle jpg

rmtheis commented 5 years ago

I don't have a good way to do it. As an interesting test, you could try running Firebase's language detection on the output of the English OCR and then run Arabic OCR if it isn't identified as English.

Note that msa is Malay and not Modern Standard Arabic.

Anyway, the slowness is a normal side effect and not really a bug in this project.

DorisGM commented 5 years ago

Thanks for your reply， I switched init different language when OCR different language image。 It looks good.

rmtheis / tess-two

Decoding is slow when multiple languages are used #261