tesseract-ocr / langdata

Source training data for Tesseract for lots of languages
Apache License 2.0
837 stars 888 forks source link

Suggest 'deva' for Devanagari #41

Closed Shreeshrii closed 7 years ago

Shreeshrii commented 7 years ago

With LSTM training the dictionary dawg files have become optional. In light of this, I want to suggest an additional traineddata file for Devanagari script, which can cater to all main languages written in it.

The reason for suggesting this is, when I tested OCR on a Marathi text, a lot of words with rakaara were not recognised correctly. However, same page OCRed with Sanskrit recognised them correctly, but some others were incorrect.

So, in addition to the multiple traineddata for various languages written in Devaन

Shreeshrii commented 7 years ago

Can add Deva.traineddata which is trained on training text for all these languages taken together.

amitdo commented 7 years ago

Related papers:

A Segmentation-Free Approach for Printed Devanagari Script Recognition (2015) Tushar Karayil, Adnan Ul-Hasan, Thomas M. Breuel

Can we build language-independent OCR using LSTM networks? (2013) Adnan Ul-Hasan, Thomas M. Breuel

More interesting papers about LSTM for OCR: https://github.com/tmbdev/ocropy/wiki/Publications

Shreeshrii commented 7 years ago

List of unicode devanagari fonts that could be used for training, if not already being used

https://github.com/tesseract-ocr/tesseract/issues/561#issuecomment-268499418

Sample og glyphs in different fonts

https://github.com/tesseract-ocr/tesseract/issues/654

amitdo commented 7 years ago

Similary. it would be nice to have a generic traineddata for multiple Latin script based langs, as described in the paper I mentioned above.

Likewise, you could provide a generic Cyrillic traineddata.

amitdo commented 7 years ago

And maybe one based on the Arabic script.

Shreeshrii commented 7 years ago

Devanagari corpus

Marathi http://www.cfilt.iitb.ac.in/hin_corp_unicode.tar http://ltrc.iiit.ac.in/ltrc/internal/nlp/corpus/ftp/marathicorp.tgz

Hindi http://www.cfilt.iitb.ac.in/hin_corp_unicode.tar http://ltrc.iiit.ac.in/ltrc/internal/nlp/corpus/ftp/hindicorp.tgz http://ocr.iiit.ac.in/Hindi100.html

Sanskrit https://sa.wikibooks.org/ https://sa.wikisource.org/

amitdo commented 7 years ago

https://github.com/tesseract-ocr/langdata/issues/41#issuecomment-272718868 @stweil I think it's related to your message here: https://groups.google.com/forum/#!topic/tesseract-dev/8H_4K3vPRJE

stweil commented 7 years ago

Likewise, you could provide a generic Cyrillic traineddata.

I assume the same would be needed for Greek. Or would it be better to include Greek characters in the Latin training set? Several sciences (especially Physics and Mathematics) use single Greek characters in texts which are mostly written with Latin letters.

Shreeshrii commented 7 years ago

https://github.com/tesseract-ocr/langdata/issues/59#issuecomment-290235084

@theraysmith commented 2 days ago

I've also added an experiment to throw all the Latin languages together into a single engine. (Actually a separate model for each of 36 scripts). If that works it will solve the problem of reading Citroen in German and picking up the e umlaut. The downside is that this model has almost 400 characters in it, despite carefully keeping out the long-tail graphics characters. Even if it does work, it will be slower, but possibly not much slower than running 2 languages. It will have about 56 languages in it. I have some optimism that this may work, ever since I discovered that the vie LSTM model gets the phototest.tif image 100% correct.

amitdo commented 7 years ago

This request was implemted by Ray:

https://github.com/tesseract-ocr/tessdata/issues/62#issuecomment-319442674

Shreeshrii commented 7 years ago

Thanks!