Best tessdata Feedback - Chinese

Shreeshrii commented 7 years ago

Ref: https://groups.google.com/d/msgid/tesseract-ocr/8cc88ed2-99c3-445e-b758-83ade0f680aa%40googlegroups.com?utm_medium=email

copied below

Good day!

Recently I was using tesseract (4.0 alpha) to do Chinese OCR and it works really great. Now I want to pick up a best model to use but I find several versions. What is the difference between them?

chi_sim from https://github.com/tesseract-ocr/tesseract/wiki/Data-Files (around 50M)
chi_sim from https://github.com/tesseract-ocr/tessdata/tree/master/best (around 13M)
chi_sim_vert from https://github.com/tesseract-ocr/tessdata/tree/master/best (around 13M)
HanS from https://github.com/tesseract-ocr/tessdata/tree/master/best (around 16M)

All of them can work but the results are slightly different. From my own evaluation #4 is the best, but I don't have any insight.

Appreciate for any help.

Shreeshrii commented 7 years ago

Please see https://github.com/tesseract-ocr/tessdata/commit/3a94ddd47be01fd897cbe31f05cbd2301454cf8a#commitcomment-23584234 which explains the difference between jpn and Japanese.

Similar logic will apply for Chinese.

copied below for easy ref:

'jpn' contains whatever appears on the www that is labelled as the language, trained only with fonts that can render Japanese. As with most of the other Script traineddatas, 'Japanese' contains all the languages that use that script (in this case just the one) PLUS English. The resulting model is trained with a mix of both training sets, with the expectation that some of the generalization to 4500 English training fonts will also apply to the other script that has a lot less. I haven't thoroughly tested whether this works, so I am interested to get feedback on it.

'jpn_vert' is trained on text rendered vertically (but the image is rotated so the long edge is still horizontal).

'jpn' loads 'jpn_vert' as a secondary language so it can try it in case the text is rendered vertically. This seems to work most of the time as a reasonable solution.

Shreeshrii commented 7 years ago

Unrecognized characters in the chi_sim traineddata model

ref: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/vp2yug1Jjko/eIP4azwnBAAJ

partially copied below


I can recognize most of the characters, such as the han, ladin alphabet. 
But some characters, such as 'Joined', ' |Broken|0|1' at the file header, and 
|"|0|2, |m|0|2 at the end of the file, cannot be recognized by myself.

Can you explan what these characters mean?
4059    ki
4060    |ki|0|2
4061    |ki|1|2
4062    |in|0|2
4063    |in|1|2
 and so on

Thx alot.

amitdo commented 7 years ago

'Joined', ' |Broken|0|1'

Those two also appear in other traineddata files.

tesseract-ocr / tessdata