Open Shreeshrii opened 7 years ago
ff00-ffef are in the forbidden_characters list for jpn. See langdata/jpn/forbidden_characters. This means they are not present in any of the Google-trained models. I don't remember how/who recommended that they should be excluded, or why, other than that they make for awkward ambiguities.
The LSTM-based engine doesn't care about the majority of the fields in the unicharset. There is no need to set_unicharset_properties if you are using combine_lang_model, and you can ignore any errors it gives you about the properties not set.
That aside, you think it desirable that it should be able to output half-width codes for half-width characters? Or should it output the full-width codes when it encounters text printed as half-width?
@hoangtocdo90 Please reply to Ray's questions here.
@theraysmith Please also see https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/z93ZTNjhZ-A/2vrbbxOMBwAJ
copied below: Hoang Vu | Aug 7 (20 hours ago)
Hi guys! I'm try to add around 1000 char to my japanese trainneddata. have a new feature like in here : https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-%C2%B1-a-few-characters we cant add a few char =)). But i want add around 1000 char. Should i have to do right now ? Train from scratch? or Training Just a Few Layers? or fine tuning for a few char some one can give me a advice ? and one more question some day ago in a commit of Ray[add best data for 4.0 version] have japanese vertical writing style traineddata. can we using the same data in both of jpn and jpn_vert training ? sorry for my bad english !! thanks so much!!
@hoangtocdo90 Please also see comment regarding unichar_extractor in https://github.com/tesseract-ocr/tesseract/issues/1065#issuecomment-320709273
@Shreeshrii @theraysmith thanks you so much !!
Or should it output the full-width codes when it encounters text printed as half-width?
that's all i want Half-width katakana and full-width katakana have big ambiguities. but half-with and full width katakana both usually using in japanese . In 3.04 version it's always return full-width whatever input image are half-width or full-width. but when i upgrade to 4.0 version half-width usually return wrong . Maybe LSTM engine have good sense =)) I have a trick and this works very well. I'm add both full-width and half-width katakana to training text and run text2img to create boxfile. So i'm doing edit box apply half-width -> full-width conversion then a do a merge box with case: バ->バ
@theraysmith
I would like to provide a supplementary explanation as a Japanese native speaker.
Half-width katakana doesn't use very often in official documents and commercial prints. It is mainly used as a net slang in the Internet forum, or if there is a limit on the width of displaying characters. Also, it used on a devices with dot-matrix liquid crystal display that can not display multi-byte characters.
In any case, I would like to point out that there is a big difference in their frequency of use.
In my opinion, it is best to customize according to what kind of document to read.
I'm working on training tesseract to recognize half-width katakana, I wonder @hoangtocdo90 do you work it out, and can you give some advice? Thank you
Related NKFC Normalization https://github.com/tesseract-ocr/tesseract/issues/1852
I'm trying to add half-width katakana characters to tesseract. I'm using "Fine Tuning for ± a few characters" as a guide.
Here are the steps that I did
after this I used the lstmtraining for fine tuning the tessdata_best jpn.trainedddata
However, even thou I already added the unicharset I am still encountering the error "Can't encode transcription", "Encoding of string failed".
Do you have any idea on how to resolve the error?
Please see https://github.com/tesseract-ocr/tesseract/issues/1046