Closed Shreeshrii closed 7 years ago
Can add Deva.traineddata which is trained on training text for all these languages taken together.
Related papers:
A Segmentation-Free Approach for Printed Devanagari Script Recognition (2015) Tushar Karayil, Adnan Ul-Hasan, Thomas M. Breuel
Can we build language-independent OCR using LSTM networks? (2013) Adnan Ul-Hasan, Thomas M. Breuel
More interesting papers about LSTM for OCR: https://github.com/tmbdev/ocropy/wiki/Publications
List of unicode devanagari fonts that could be used for training, if not already being used
https://github.com/tesseract-ocr/tesseract/issues/561#issuecomment-268499418
Sample og glyphs in different fonts
Similary. it would be nice to have a generic traineddata for multiple Latin script based langs, as described in the paper I mentioned above.
Likewise, you could provide a generic Cyrillic traineddata.
And maybe one based on the Arabic script.
Devanagari corpus
Marathi http://www.cfilt.iitb.ac.in/hin_corp_unicode.tar http://ltrc.iiit.ac.in/ltrc/internal/nlp/corpus/ftp/marathicorp.tgz
Hindi http://www.cfilt.iitb.ac.in/hin_corp_unicode.tar http://ltrc.iiit.ac.in/ltrc/internal/nlp/corpus/ftp/hindicorp.tgz http://ocr.iiit.ac.in/Hindi100.html
Sanskrit https://sa.wikibooks.org/ https://sa.wikisource.org/
https://github.com/tesseract-ocr/langdata/issues/41#issuecomment-272718868 @stweil I think it's related to your message here: https://groups.google.com/forum/#!topic/tesseract-dev/8H_4K3vPRJE
Likewise, you could provide a generic Cyrillic traineddata.
I assume the same would be needed for Greek. Or would it be better to include Greek characters in the Latin training set? Several sciences (especially Physics and Mathematics) use single Greek characters in texts which are mostly written with Latin letters.
https://github.com/tesseract-ocr/langdata/issues/59#issuecomment-290235084
@theraysmith commented 2 days ago
I've also added an experiment to throw all the Latin languages together into a single engine. (Actually a separate model for each of 36 scripts). If that works it will solve the problem of reading Citroen in German and picking up the e umlaut. The downside is that this model has almost 400 characters in it, despite carefully keeping out the long-tail graphics characters. Even if it does work, it will be slower, but possibly not much slower than running 2 languages. It will have about 56 languages in it. I have some optimism that this may work, ever since I discovered that the vie LSTM model gets the phototest.tif image 100% correct.
This request was implemted by Ray:
https://github.com/tesseract-ocr/tessdata/issues/62#issuecomment-319442674
Thanks!
With LSTM training the dictionary dawg files have become optional. In light of this, I want to suggest an additional traineddata file for Devanagari script, which can cater to all main languages written in it.
The reason for suggesting this is, when I tested OCR on a Marathi text, a lot of words with rakaara were not recognised correctly. However, same page OCRed with Sanskrit recognised them correctly, but some others were incorrect.
So, in addition to the multiple traineddata for various languages written in Devaन