tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
Apache License 2.0
61.08k stars 9.39k forks source link

Text Detection is not proper for language is Korea #4016

Open vamshi-1611 opened 1 year ago

vamshi-1611 commented 1 year ago

Basic Information

Tesseract version : tesseract 4.0.0-beta.1 leptonica-1.75.3 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

Found AVX2 Found AVX Found SSE As few new unichars are not available in korean language so i used Hangul.unicharset and combined with eng.unicharset 1.To generate kor.lstm file i am using below command !combine_tessdata -e ./tessdata_best/kor.traineddata kor.lstm

  1. Generated unicharset with below command !merge_unicharsets ./langdata/Hangul.unicharset ./langdata/eng/eng.unicharset ./kor.unicharset replaced generated kor.unicharset in langdata/kor/ 3.Created a train folder to generate tiff file with boxes and training text file and uploaded fonts in fonts folder using below code !rm -rf train/* ! ./tesseract/src/training/tesstrain.sh --fonts_dir fonts \ --fontlist 'fontname' \ --noextract_font_properties \ --lang kor \ --linedata_only \ --langdata_dir langdata/ \ --tessdata_dir ./tesseract/tessdata/ \ --save_box_tiff \ --maxpages 20 \ --output_dir train
  2. Now started fine-tuning by below command !rm -rf output/* !lstmtraining \ --model_output output/kor \ --continue_from ./kor.lstm \ --traineddata ./tesseract/tessdata/kor.traineddata \ --old_traineddata tessdata_best/kor.traineddata \ --train_listfile ./train/kor.training_files.txt \ --target_error_rate 0.1 \ --debug_interval -1

output : Loaded file ./kor.lstm, unpacking... Warning: LSTMTrainer deserialized an LSTMRecognizer! Code range changed from 144 to 144! Num (Extended) outputs,weights in Series: 1,48,0,1:1, 0 Num (Extended) outputs,weights in Series: C3,3:9, 0 Ft16:16, 160 Total weights = 160 [C3,3Ft16]:16, 160 Mp3,3:16, 0 Lfys64:64, 20736 Lfx96:96, 61824 Lrx96:96, 74112 Lfx512:512, 1247232 Fc144:144, 73872 Total weights = 1477936 Previous null char=143 mapped to 143 Continuing from ./kor.lstm Loaded 848/848 pages (1-848) of document train/kor.fontname.exp0.lstmf Loaded 848/848 pages (1-848) of document train/kor.fontname.exp0.lstmf Iteration 0: ALIGNED TRUTH : 드라이브 모드가 변경될 때 변경 정보가 화면에 표시되지 않습니다. Iteration 0: BEST OCR TEXT : 드라이브 모드가 변경될 때 변경 정보가 화면에 표시되지 않습니다. File /tmp/kor-2023-02-09.o0I/kor.HyundaiSansUI_JP_KR_Latin.exp0.lstmf page 149 (Perfect): Mean rms=0.097%, delta=0%, train=0%(0%), skip ratio=0% Iteration 1: ALIGNED TRUTH : 도로 표지판과 신호체계는 수시로 변경될 수 있으므로 항법 장치에 의해 경로 안내를 받을 때에도 반드시 실제의 교통법규를 준수하여 운전하셔야 합니다. 모든 조작은 반드시 정 Iteration 1: BEST OCR TEXT : 도로 표지판과 신호체계는 수시로 변경될 수 있으므로 항법 장치에 의해 경로 안내를 받을 때에도 반드시 실제의 교통법규를 준수하여 운전하셔야 합니다. 모든 조작은 반드시 정 File /tmp/kor-2023-02-09.o0I/kor.HyundaiSansUI_JP_KR_Latin.exp0.lstmf page 13 (Perfect):

  1. After Training completed with least checkpoint as kor1.752_420.checkpoint and kor_checkpoint !lstmeval --model output/kor1.752_420.checkpoint \ --traineddata ./tessdata_best/kor.traineddata \ --eval_listfile train/kor.training_files.txt

output : output/kor1.752_420.checkpoint is not a recognition model, trying training checkpoint... Loaded 848/848 pages (1-848) of document train/kor.HyundaiSansUI_JP_KR_Latin.exp0.lstmf Compute CTC targets failed! Encoding of string failed! Failure bytes: ffffffeb ffffff81 ffffff95 ffffffeb ffffff8b ffffff88 ffffffeb ffffff8b ffffffa4 2e Can't encode transcription: '전방 안전 기능을 끕니다.' in language '' Compute CTC targets failed! Compute CTC targets failed! Truth:초 이력 OCR :주이력 Compute CTC targets failed! Encoding of string failed! Failure bytes: ffffffec ffffff95 ffffffb1 ffffffec ffffff9d ffffff84 20 ffffffec ffffff8b ffffffa4 ffffffed ffffff96 ffffff89 ffffffed ffffff95 ffffff98 ffffffec ffffff8b ffffffad ffffffec ffffff8b ffffff9c ffffffec ffffff98 ffffffa4 2e Can't encode transcription: '휴대폰에서 음악 앱을 실행하십시오.' in language '' Truth:우측결과를 저장 중입니다. 잠시만 기다려 주십시오. OCR :츠집오 Compute CTC targets failed! Encoding of string failed! Failure bytes: ffffffeb ffffff81 ffffff95 ffffffeb ffffff8b ffffff88 ffffffeb ffffff8b ffffffa4 2e Can't encode transcription: '전방 안전 기능을 끕니다.' in language '' At iteration 0, stage 0, Eval Char error rate=1.8989044, Word error rate=2.4764151

6.Saved the trained model by below command !lstmtraining --stop_training --continue_from output/kor_checkpoint --traineddata ./tessdata_best/kor.traineddata --model_output TrainedModel/kor.traineddata

written python code as below to get text from PIL import Image import re import pytesseract tessdata_dir_config = r'--tessdata-dir "./TrainedModel/"' a=pytesseract.image_to_string('MicrosoftTeams-image (8).png', lang='kor',config=tessdata_dir_config) print(a)

Returns junk characters as below strings not matching 5 10



자동 줄이기


@ [ㆍ 오디오 음

Anyone please let me know what is error and provide solution for this error

Operating System

Ubuntu 20.04 Focal

Other Operating System

No response

uname -a

No response


No response

Virtualization / Containers

No response


No response

Current Behavior

No response

Expected Behavior

No response

Suggested Fix

No response

Other Information

No response

Tax0787 commented 11 months ago