tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.45k stars 9.53k forks source link

Legacy ara language not working with recent versions of tesseract #3929

Open naourass opened 2 years ago

naourass commented 2 years ago

Environment

Current Behavior:

While other legacy languages are working fine with recent versions of tesseract, legacy ara 4.00 is returning this error in all versions listed above (--oem 0) : read_params_file: Can't open txt mgr->GetComponent(TESSDATA_INTTEMP, &fp):Error:Assert failed:in file src/classify/adaptmatch.cpp

I'm using tessdata 4.00 because it seems that arabic legacy model has been removed from the newer versions.

Suggested Fix:

Update ara traineddata file with legacy support for tesseract 5.x, or add documentation for tesseract 4.00 installation.

amitdo commented 2 years ago

Arabic was using the 'Cube' OCR engine. The code for that engine was removed in version 4.0 and will not be restored. Other languages used another engine which we now call 'the legacy engine'. The legacy engine is still supported in version 5.2.0

The only thing we can do with this issue is to improve the documentation.

amitdo commented 2 years ago

We can also do something like this in the code:

if (lang == 'ara' and oem == 0) {
  print("Error: Oem 0 is not supported for Arabic");
  return EXIT_FAILURE;
}

There's probably a better way to handle this issue, but the suggested one above will solve it and is 'good enough'.

Shreeshrii commented 1 year ago

Cube was also used by Hindi (hin) and other Indic languages.