naptha / tesseract.js

Pure Javascript OCR for more than 100 Languages 📖🎉🖥
http://tesseract.projectnaptha.com/
Apache License 2.0
34.09k stars 2.15k forks source link

Legacy model does not work for indic and arabic scripts due to Legacy data being removed #931

Open Balearica opened 2 weeks ago

Balearica commented 2 weeks ago

Tesseract.js uses two sets of language data by default. When the oem is set to the default (LSTM only), integerized versions of tessdata_best (LSTM only data) are used. When oem is set to Legacy or LSTM with Legacy fallback, files from tessdata are used, which generally contain both the integerized version of tessdata_best and data for the Legacy model.

While this generally works, it looks like the Legacy model was removed from several languages in the files in the tessdata repo. This appears to have been motivated purely by the fact that these files are large and do not perform well compared to the LSTM models.

https://github.com/tesseract-ocr/tessdata/pull/90

These justifications do not make sense for our case, so the Tesseract.js data should be modified to add these back. Based in the PR linked above, it looks like most users should not be using the Legacy model for these languages--as the LSTM model is both much smaller and performs much better. However, within Tesseract.js, we only load the files from tessdata if the user specifically requested the Legacy model. If the user sets the oem to Legacy, we need to load Legacy data.

Note that this issue is specific to the cases where both Legacy and LSTM language data exists, however the Legacy data was removed. There are other languages where data for one model never existed in the first place, which will remain broken.