Open sethleech opened 6 years ago
Based on comments by @theraysmith, all other properties are not required for lstm training.
On 24-Oct-2017 12:34 PM, "sethleech" notifications@github.com wrote:
How to Add or Edit [script].unicharset in langdata folder?
- I want to know How to get 'glyph_metrics' data from [font or several fonts].
Dear all,
I am trying tesseart recently and it is really a very good product. I would like to ask if there is any tutorial or steps about How to Add or Edit [script].unicharset? for example han.unicharset
I want to add missing chars or unicode chars for CJK Extensions B,C,D,E,F. CJK Unified Ideographs Extension B: U+20000–U+2A6D6 CJK Unified Ideographs Extension C: U+2A700–U+2B734 CJK Unified Ideographs Extension D: U+2B740–U+2B81D CJK Unified Ideographs Extension E: U+2B820–U+2CEA1 CJK Unified Ideographs Extension F: U+2CEB0–U+2EBE0
Please refer : when training tesseract, I tried this
1st try : unicharset_extractor tesseract-ocr/unicharset_extractor -D [lang] [lang]/[lang].[font].exp0.box
output is unicharset : 𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x
set_unicharset_properties tesseract-ocr/set_unicharset_properties -U unicharset -O [lang].unicharset --script_dir=langdata/[lang] --X langdata/[lang]/han.xheights
Warning: properties incomplete for index 4 = 𥮗
output is [lang].unicharset : 𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x => not changed
2nd try : I edited file langdata/han.unicharset line 0 : 23514 -> 23515 add new line in end of lines 𥮗 1 61,64,255,255,188,200,6,11,205,224 Han 23514 0 23514 𥮗 # 𥮗 [25b97 ]x copied data 61,64,255,255,188,200,6,11,205,224 from any other line. ex) line 67
unicharset_extractor tesseract-ocr/unicharset_extractor -D [lang] [lang]/[lang].[font].exp0.box
output is unicharset : 𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x
set_unicharset_properties tesseract-ocr/set_unicharset_properties -U unicharset -O [lang].unicharset --script_dir=langdata/[lang] --X langdata/[lang]/han.xheights no warning
output is [lang].unicharset : 𥮗 1 61,64,255,255,188,200,6,11,205,224 Han 4 0 4 𥮗 # 𥮗 [25b97 ]x => changed
I found out
- [script].unicharset file is officially supported.
- entry properties : 'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form'
How to get 'glyph_metrics' data from [font or several fonts]?
Thank you in advance.
Regards,
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/99, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oyd9YSyC9dte2FuL_UxEzgIEoZguks5svYvhgaJpZM4QD8GV .
My project is running on android-device. By now Tesseract 4.0 can't be used on android-device because of build-issue, "AVX" and "SSE". So I can use Tesseract 3.05.01.
Pls any information?
I have the same question.
How to Add or Edit [script].unicharset in langdata folder?
Dear all,
I am trying tesseart recently and it is really a very good product. I would like to ask if there is any tutorial or steps about How to Add or Edit [script].unicharset? for example han.unicharset
I want to add missing chars or unicode chars for CJK Extensions B,C,D,E,F. CJK Unified Ideographs Extension B: U+20000–U+2A6D6 CJK Unified Ideographs Extension C: U+2A700–U+2B734 CJK Unified Ideographs Extension D: U+2B740–U+2B81D CJK Unified Ideographs Extension E: U+2B820–U+2CEA1 CJK Unified Ideographs Extension F: U+2CEB0–U+2EBE0
Please refer : when training tesseract, I tried this
1st try : unicharset_extractor tesseract-ocr/unicharset_extractor -D [lang] [lang]/[lang].[font].exp0.box
output is unicharset : 𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x
set_unicharset_properties tesseract-ocr/set_unicharset_properties -U unicharset -O [lang].unicharset --script_dir=langdata/[lang] --X langdata/[lang]/han.xheights
Warning: properties incomplete for index 4 = 𥮗
output is [lang].unicharset : 𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x => not changed
2nd try : I edited file langdata/han.unicharset line 0 : 23514 -> 23515 add new line in end of lines 𥮗 1 61,64,255,255,188,200,6,11,205,224 Han 23514 0 23514 𥮗 # 𥮗 [25b97 ]x copied data 61,64,255,255,188,200,6,11,205,224 from any other line. ex) line 67
unicharset_extractor tesseract-ocr/unicharset_extractor -D [lang] [lang]/[lang].[font].exp0.box
output is unicharset : 𥮗 1 0,255,0,255,0,0,0,0,0,0 NULL 4 0 0 # 𥮗 [25b97 ]x
set_unicharset_properties tesseract-ocr/set_unicharset_properties -U unicharset -O [lang].unicharset --script_dir=langdata/[lang] --X langdata/[lang]/han.xheights no warning
output is [lang].unicharset : 𥮗 1 61,64,255,255,188,200,6,11,205,224 Han 4 0 4 𥮗 # 𥮗 [25b97 ]x => changed
I found out 1) [script].unicharset file is officially supported. 2) entry properties : 'character' 'properties' 'glyph_metrics' 'script' 'other_case' 'direction' 'mirror' 'normed_form'
How to get 'glyph_metrics' data from [font or several fonts]?
Thank you in advance.
Regards,