Legacy ara language not working with recent versions of tesseract

naourass commented 2 years ago

Environment

Tesseract Version: 5.x, 4.1.x, 4.0.x
Platform: Linux DESKTOP-**** 5.10.102.1-microsoft-standard-WSL2 x86_64 GNU/Linux (Ubuntu 20.04)

Current Behavior:

While other legacy languages are working fine with recent versions of tesseract, legacy ara 4.00 is returning this error in all versions listed above (--oem 0) : read_params_file: Can't open txt mgr->GetComponent(TESSDATA_INTTEMP, &fp):Error:Assert failed:in file src/classify/adaptmatch.cpp

I'm using tessdata 4.00 because it seems that arabic legacy model has been removed from the newer versions.

Suggested Fix:

Update ara traineddata file with legacy support for tesseract 5.x, or add documentation for tesseract 4.00 installation.

amitdo commented 2 years ago

Arabic was using the 'Cube' OCR engine. The code for that engine was removed in version 4.0 and will not be restored. Other languages used another engine which we now call 'the legacy engine'. The legacy engine is still supported in version 5.2.0

The only thing we can do with this issue is to improve the documentation.

amitdo commented 2 years ago

We can also do something like this in the code:

if (lang == 'ara' and oem == 0) {
  print("Error: Oem 0 is not supported for Arabic");
  return EXIT_FAILURE;
}

There's probably a better way to handle this issue, but the suggested one above will solve it and is 'good enough'.

Shreeshrii commented 1 year ago

Cube was also used by Hindi (hin) and other Indic languages.

tesseract-ocr / tesseract