Found AVX2
Found AVX
Found SSE
As few new unichars are not available in korean language so i used Hangul.unicharset and combined with eng.unicharset
1.To generate kor.lstm file i am using below command
!combine_tessdata -e ./tessdata_best/kor.traineddata kor.lstm
Generated unicharset with below command
!merge_unicharsets ./langdata/Hangul.unicharset ./langdata/eng/eng.unicharset ./kor.unicharset
replaced generated kor.unicharset in langdata/kor/
3.Created a train folder to generate tiff file with boxes and training text file and uploaded fonts in fonts folder using below code
!rm -rf train/*
! ./tesseract/src/training/tesstrain.sh --fonts_dir fonts \
--fontlist 'fontname' \
--noextract_font_properties \
--lang kor \
--linedata_only \
--langdata_dir langdata/ \
--tessdata_dir ./tesseract/tessdata/ \
--save_box_tiff \
--maxpages 20 \
--output_dir train
output :
Loaded file ./kor.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 144 to 144!
Num (Extended) outputs,weights in Series:
1,48,0,1:1, 0
Num (Extended) outputs,weights in Series:
C3,3:9, 0
Ft16:16, 160
Total weights = 160
[C3,3Ft16]:16, 160
Mp3,3:16, 0
Lfys64:64, 20736
Lfx96:96, 61824
Lrx96:96, 74112
Lfx512:512, 1247232
Fc144:144, 73872
Total weights = 1477936
Previous null char=143 mapped to 143
Continuing from ./kor.lstm
Loaded 848/848 pages (1-848) of document train/kor.fontname.exp0.lstmf
Loaded 848/848 pages (1-848) of document train/kor.fontname.exp0.lstmf
Iteration 0: ALIGNED TRUTH : 드라이브 모드가 변경될 때 변경 정보가 화면에 표시되지 않습니다.
Iteration 0: BEST OCR TEXT : 드라이브 모드가 변경될 때 변경 정보가 화면에 표시되지 않습니다.
File /tmp/kor-2023-02-09.o0I/kor.HyundaiSansUI_JP_KR_Latin.exp0.lstmf page 149 (Perfect):
Mean rms=0.097%, delta=0%, train=0%(0%), skip ratio=0%
Iteration 1: ALIGNED TRUTH : 도로 표지판과 신호체계는 수시로 변경될 수 있으므로 항법 장치에 의해 경로 안내를 받을 때에도 반드시 실제의 교통법규를 준수하여 운전하셔야 합니다. 모든 조작은 반드시 정
Iteration 1: BEST OCR TEXT : 도로 표지판과 신호체계는 수시로 변경될 수 있으므로 항법 장치에 의해 경로 안내를 받을 때에도 반드시 실제의 교통법규를 준수하여 운전하셔야 합니다. 모든 조작은 반드시 정
File /tmp/kor-2023-02-09.o0I/kor.HyundaiSansUI_JP_KR_Latin.exp0.lstmf page 13 (Perfect):
After Training completed with least checkpoint as kor1.752_420.checkpoint and kor_checkpoint
!lstmeval --model output/kor1.752_420.checkpoint \
--traineddata ./tessdata_best/kor.traineddata \
--eval_listfile train/kor.training_files.txt
output :
output/kor1.752_420.checkpoint is not a recognition model, trying training checkpoint...
Loaded 848/848 pages (1-848) of document train/kor.HyundaiSansUI_JP_KR_Latin.exp0.lstmf
Compute CTC targets failed!
Encoding of string failed! Failure bytes: ffffffeb ffffff81 ffffff95 ffffffeb ffffff8b ffffff88 ffffffeb ffffff8b ffffffa4 2e
Can't encode transcription: '전방 안전 기능을 끕니다.' in language ''
Compute CTC targets failed!
Compute CTC targets failed!
Truth:초 이력
OCR :주이력
Compute CTC targets failed!
Encoding of string failed! Failure bytes: ffffffec ffffff95 ffffffb1 ffffffec ffffff9d ffffff84 20 ffffffec ffffff8b ffffffa4 ffffffed ffffff96 ffffff89 ffffffed ffffff95 ffffff98 ffffffec ffffff8b ffffffad ffffffec ffffff8b ffffff9c ffffffec ffffff98 ffffffa4 2e
Can't encode transcription: '휴대폰에서 음악 앱을 실행하십시오.' in language ''
Truth:우측결과를 저장 중입니다. 잠시만 기다려 주십시오.
OCR :츠집오
Compute CTC targets failed!
Encoding of string failed! Failure bytes: ffffffeb ffffff81 ffffff95 ffffffeb ffffff8b ffffff88 ffffffeb ffffff8b ffffffa4 2e
Can't encode transcription: '전방 안전 기능을 끕니다.' in language ''
At iteration 0, stage 0, Eval Char error rate=1.8989044, Word error rate=2.4764151
6.Saved the trained model by below command
!lstmtraining --stop_training --continue_from output/kor_checkpoint
--traineddata ./tessdata_best/kor.traineddata --model_output TrainedModel/kor.traineddata
written python code as below to get text
from PIL import Image
import re
import pytesseract
tessdata_dir_config = r'--tessdata-dir "./TrainedModel/"'
a=pytesseract.image_to_string('MicrosoftTeams-image (8).png', lang='kor',config=tessdata_dir_config)
print(a)
Returns junk characters as below strings not matching
5
10
디어
힌
0
자동 줄이기
13
@ [ㆍ 오디오 음
Anyone please let me know what is error and provide solution for this error
Basic Information
Tesseract version : tesseract 4.0.0-beta.1 leptonica-1.75.3 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2 Found AVX Found SSE As few new unichars are not available in korean language so i used Hangul.unicharset and combined with eng.unicharset 1.To generate kor.lstm file i am using below command !combine_tessdata -e ./tessdata_best/kor.traineddata kor.lstm
output : Loaded file ./kor.lstm, unpacking... Warning: LSTMTrainer deserialized an LSTMRecognizer! Code range changed from 144 to 144! Num (Extended) outputs,weights in Series: 1,48,0,1:1, 0 Num (Extended) outputs,weights in Series: C3,3:9, 0 Ft16:16, 160 Total weights = 160 [C3,3Ft16]:16, 160 Mp3,3:16, 0 Lfys64:64, 20736 Lfx96:96, 61824 Lrx96:96, 74112 Lfx512:512, 1247232 Fc144:144, 73872 Total weights = 1477936 Previous null char=143 mapped to 143 Continuing from ./kor.lstm Loaded 848/848 pages (1-848) of document train/kor.fontname.exp0.lstmf Loaded 848/848 pages (1-848) of document train/kor.fontname.exp0.lstmf Iteration 0: ALIGNED TRUTH : 드라이브 모드가 변경될 때 변경 정보가 화면에 표시되지 않습니다. Iteration 0: BEST OCR TEXT : 드라이브 모드가 변경될 때 변경 정보가 화면에 표시되지 않습니다. File /tmp/kor-2023-02-09.o0I/kor.HyundaiSansUI_JP_KR_Latin.exp0.lstmf page 149 (Perfect): Mean rms=0.097%, delta=0%, train=0%(0%), skip ratio=0% Iteration 1: ALIGNED TRUTH : 도로 표지판과 신호체계는 수시로 변경될 수 있으므로 항법 장치에 의해 경로 안내를 받을 때에도 반드시 실제의 교통법규를 준수하여 운전하셔야 합니다. 모든 조작은 반드시 정 Iteration 1: BEST OCR TEXT : 도로 표지판과 신호체계는 수시로 변경될 수 있으므로 항법 장치에 의해 경로 안내를 받을 때에도 반드시 실제의 교통법규를 준수하여 운전하셔야 합니다. 모든 조작은 반드시 정 File /tmp/kor-2023-02-09.o0I/kor.HyundaiSansUI_JP_KR_Latin.exp0.lstmf page 13 (Perfect):
output : output/kor1.752_420.checkpoint is not a recognition model, trying training checkpoint... Loaded 848/848 pages (1-848) of document train/kor.HyundaiSansUI_JP_KR_Latin.exp0.lstmf Compute CTC targets failed! Encoding of string failed! Failure bytes: ffffffeb ffffff81 ffffff95 ffffffeb ffffff8b ffffff88 ffffffeb ffffff8b ffffffa4 2e Can't encode transcription: '전방 안전 기능을 끕니다.' in language '' Compute CTC targets failed! Compute CTC targets failed! Truth:초 이력 OCR :주이력 Compute CTC targets failed! Encoding of string failed! Failure bytes: ffffffec ffffff95 ffffffb1 ffffffec ffffff9d ffffff84 20 ffffffec ffffff8b ffffffa4 ffffffed ffffff96 ffffff89 ffffffed ffffff95 ffffff98 ffffffec ffffff8b ffffffad ffffffec ffffff8b ffffff9c ffffffec ffffff98 ffffffa4 2e Can't encode transcription: '휴대폰에서 음악 앱을 실행하십시오.' in language '' Truth:우측결과를 저장 중입니다. 잠시만 기다려 주십시오. OCR :츠집오 Compute CTC targets failed! Encoding of string failed! Failure bytes: ffffffeb ffffff81 ffffff95 ffffffeb ffffff8b ffffff88 ffffffeb ffffff8b ffffffa4 2e Can't encode transcription: '전방 안전 기능을 끕니다.' in language '' At iteration 0, stage 0, Eval Char error rate=1.8989044, Word error rate=2.4764151
6.Saved the trained model by below command !lstmtraining --stop_training --continue_from output/kor_checkpoint --traineddata ./tessdata_best/kor.traineddata --model_output TrainedModel/kor.traineddata
written python code as below to get text from PIL import Image import re import pytesseract tessdata_dir_config = r'--tessdata-dir "./TrainedModel/"' a=pytesseract.image_to_string('MicrosoftTeams-image (8).png', lang='kor',config=tessdata_dir_config) print(a)
Returns junk characters as below strings not matching 5 10
디어
힌
0
자동 줄이기
13
@ [ㆍ 오디오 음
Anyone please let me know what is error and provide solution for this error
Operating System
Ubuntu 20.04 Focal
Other Operating System
No response
uname -a
No response
Compiler
No response
Virtualization / Containers
No response
CPU
No response
Current Behavior
No response
Expected Behavior
No response
Suggested Fix
No response
Other Information
No response