Open naourass opened 2 years ago
Arabic was using the 'Cube' OCR engine. The code for that engine was removed in version 4.0 and will not be restored. Other languages used another engine which we now call 'the legacy engine'. The legacy engine is still supported in version 5.2.0
The only thing we can do with this issue is to improve the documentation.
We can also do something like this in the code:
if (lang == 'ara' and oem == 0) {
print("Error: Oem 0 is not supported for Arabic");
return EXIT_FAILURE;
}
There's probably a better way to handle this issue, but the suggested one above will solve it and is 'good enough'.
Cube was also used by Hindi (hin) and other Indic languages.
Environment
Current Behavior:
While other legacy languages are working fine with recent versions of tesseract, legacy ara 4.00 is returning this error in all versions listed above (--oem 0) :
read_params_file: Can't open txt mgr->GetComponent(TESSDATA_INTTEMP, &fp):Error:Assert failed:in file src/classify/adaptmatch.cpp
I'm using tessdata 4.00 because it seems that arabic legacy model has been removed from the newer versions.
Suggested Fix:
Update ara traineddata file with legacy support for tesseract 5.x, or add documentation for tesseract 4.00 installation.