Open Shreeshrii opened 7 years ago
Please see https://github.com/tesseract-ocr/tessdata/commit/3a94ddd47be01fd897cbe31f05cbd2301454cf8a#commitcomment-23584234 which explains the difference between jpn and Japanese.
Similar logic will apply for Chinese.
'jpn' contains whatever appears on the www that is labelled as the language, trained only with fonts that can render Japanese. As with most of the other Script traineddatas, 'Japanese' contains all the languages that use that script (in this case just the one) PLUS English. The resulting model is trained with a mix of both training sets, with the expectation that some of the generalization to 4500 English training fonts will also apply to the other script that has a lot less. I haven't thoroughly tested whether this works, so I am interested to get feedback on it.
'jpn_vert' is trained on text rendered vertically (but the image is rotated so the long edge is still horizontal).
'jpn' loads 'jpn_vert' as a secondary language so it can try it in case the text is rendered vertically. This seems to work most of the time as a reasonable solution.
Unrecognized characters in the chi_sim traineddata model
I can recognize most of the characters, such as the han, ladin alphabet.
But some characters, such as 'Joined', ' |Broken|0|1' at the file header, and
|"|0|2, |m|0|2 at the end of the file, cannot be recognized by myself.
Can you explan what these characters mean?
4059 ki
4060 |ki|0|2
4061 |ki|1|2
4062 |in|0|2
4063 |in|1|2
and so on
Thx alot.
'Joined', ' |Broken|0|1'
Those two also appear in other traineddata files.
Ref: https://groups.google.com/d/msgid/tesseract-ocr/8cc88ed2-99c3-445e-b758-83ade0f680aa%40googlegroups.com?utm_medium=email
copied below
Good day!
Recently I was using tesseract (4.0 alpha) to do Chinese OCR and it works really great. Now I want to pick up a best model to use but I find several versions. What is the difference between them?
All of them can work but the results are slightly different. From my own evaluation #4 is the best, but I don't have any insight.
Appreciate for any help.