tesseract-ocr / langdata

Source training data for Tesseract for lots of languages
Apache License 2.0
834 stars 888 forks source link

Add Half-width Katakana for Japanese #81

Open Shreeshrii opened 7 years ago

Shreeshrii commented 7 years ago

Please see https://github.com/tesseract-ocr/tesseract/issues/1046

I have checked in langdata/Katakana.unicharset. Don't have any half-width katakana symbol. Because of this i can't make a unicharset file with all the fields set to the right values, like in this example

theraysmith commented 7 years ago

ff00-ffef are in the forbidden_characters list for jpn. See langdata/jpn/forbidden_characters. This means they are not present in any of the Google-trained models. I don't remember how/who recommended that they should be excluded, or why, other than that they make for awkward ambiguities.

The LSTM-based engine doesn't care about the majority of the fields in the unicharset. There is no need to set_unicharset_properties if you are using combine_lang_model, and you can ignore any errors it gives you about the properties not set.

That aside, you think it desirable that it should be able to output half-width codes for half-width characters? Or should it output the full-width codes when it encounters text printed as half-width?

Shreeshrii commented 7 years ago

@hoangtocdo90 Please reply to Ray's questions here.

Shreeshrii commented 7 years ago

@theraysmith Please also see https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/z93ZTNjhZ-A/2vrbbxOMBwAJ

copied below: Hoang Vu | Aug 7 (20 hours ago)

Hi guys! I'm try to add around 1000 char to my japanese trainneddata. have a new feature like in here : https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-%C2%B1-a-few-characters we cant add a few char =)). But i want add around 1000 char. Should i have to do right now ? Train from scratch? or Training Just a Few Layers? or fine tuning for a few char some one can give me a advice ? and one more question some day ago in a commit of Ray[add best data for 4.0 version] have japanese vertical writing style traineddata. can we using the same data in both of jpn and jpn_vert training ? sorry for my bad english !! thanks so much!!

Shreeshrii commented 7 years ago

@hoangtocdo90 Please also see comment regarding unichar_extractor in https://github.com/tesseract-ocr/tesseract/issues/1065#issuecomment-320709273

hoangtocdo90 commented 7 years ago

@Shreeshrii @theraysmith thanks you so much !!

Or should it output the full-width codes when it encounters text printed as half-width?

that's all i want Half-width katakana and full-width katakana have big ambiguities. but half-with and full width katakana both usually using in japanese . In 3.04 version it's always return full-width whatever input image are half-width or full-width. but when i upgrade to 4.0 version half-width usually return wrong . Maybe LSTM engine have good sense =)) I have a trick and this works very well. I'm add both full-width and half-width katakana to training text and run text2img to create boxfile. So i'm doing edit box apply half-width -> full-width conversion then a do a merge box with case: バ->バ

atuyosi commented 7 years ago

@theraysmith

I would like to provide a supplementary explanation as a Japanese native speaker.

Half-width katakana doesn't use very often in official documents and commercial prints. It is mainly used as a net slang in the Internet forum, or if there is a limit on the width of displaying characters. Also, it used on a devices with dot-matrix liquid crystal display that can not display multi-byte characters.

In any case, I would like to point out that there is a big difference in their frequency of use.

In my opinion, it is best to customize according to what kind of document to read.

xinqinglxl commented 6 years ago

I'm working on training tesseract to recognize half-width katakana, I wonder @hoangtocdo90 do you work it out, and can you give some advice? Thank you

Shreeshrii commented 6 years ago

Related NKFC Normalization https://github.com/tesseract-ocr/tesseract/issues/1852

rennnenen commented 4 years ago

I'm trying to add half-width katakana characters to tesseract. I'm using "Fine Tuning for ± a few characters" as a guide.

Here are the steps that I did

  1. I remove the line of the unicode block that contains the half-width forms from the "forbidden_characters".
  2. Added the unicharsets that was generated using "unicharset_extractor" to the jpn.unicharset and katakana.unicharset in the langdata_lstm scripts.
  3. Added half-width katakana characters and words to the jpn.wordlist
  4. Added those words and characters to the jpn.training_text
  5. Generate training images using "tesstrain.sh"

after this I used the lstmtraining for fine tuning the tessdata_best jpn.trainedddata

However, even thou I already added the unicharset I am still encountering the error "Can't encode transcription", "Encoding of string failed".

Do you have any idea on how to resolve the error?