tesseract-ocr / langdata_lstm

Data used for LSTM model training
Apache License 2.0
114 stars 152 forks source link

Slight modification in Bodhi for incorporating a few unique characters in Drenjongke #54

Open bloodgroup-cplusplus opened 8 months ago

bloodgroup-cplusplus commented 8 months ago

All the traning data present in bodhi and dzongkha very much applies to Drenjongke with a exception of two mentioned below 1)Since the size of our corpus is not large, we could have typed all the data, but we opted for using the OCR method instead. Testing the OCR method was beneficial because we found that the OCR-ed texts contained errors due to the “tsha-lag” ◌༹ marker, which is used to mark the pronunciation of [bj] in Drenjongke. The use of this marker is unique to Drenjongke because Tibetan (bodhi) does not have the sound [bj].. 2)For tokenization, space was set as a delimiter. Drenjongke script is marked by a syllable marker called “tsheg” ་, and has a space between potential morpheme or word boundaries. The use of space in the orthography is specific to Drenjongke as other Tibetan languages do not utilize spacing in a sentence.

Since these two are minor issues so we decided not to train entire Drenjongke from scratch instead add the required character for solving problem 1 initially. Also on the previous issue (https://github.com/amitdo) mentioned that training should be done from our side ... Since our expertise lies in lanuage and not in programming,what we understood is to use the entire tesseract-ocr repo by cloning it locally make the changes and then train it or is it done some other way ... Any help would be highly appreciated.