tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.23k stars 9.51k forks source link

Getting trouble when using lstmtraining to fine tuning #2468

Open Bunny22222 opened 5 years ago

Bunny22222 commented 5 years ago

When I using

training/lstmtraining --debug_interval 100 \
  --traineddata /home/app/tesseract/tesstutorial/chi_simtrain/chi_sim/chi_sim.traineddata \
  --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
  --model_output /home/app/tesseract/tesstutorial --learning_rate 20e-4 \
  --train_listfile /home/app/tesseract/tesstutorial/chi_simtrain/chi_sim.training_files.txt\
  --eval_listfile /home/app/tesseract/tesstutorial/chi_simeval/chi_sim.training_files.txt \
  --max_iterations 5000 &>/home/app/tesseract/tesstutorial/basetrain.log

In basetrain.log file shows error messages:Can't encode transcription: '棠会泞 诫蝣腹 伛铼虢 变绯甚 黛绑茔 凇粑嗉 洳钓廨 勃荩掰 崾丹钠 拽古仙 敬崛蒉 宠广牦 殂楦种 耱鲆憧 媛嵌陵 莴横贴' in language ' '

when I using

training/combine_tessdata -e ../tessdata/chi_sim.traineddata ../tesstutorial/chi_simeval/chi_sim.
Extracting tessdata components from ../tessdata/chi_sim.traineddata

The error message: tesseract::TessdataManager::TessdataTypeFromFileName(filename, &type):Error:Assert failed:in file tessdatamanager.cpp, line 298

Shreeshrii commented 5 years ago
training/combine_tessdata -e ../tessdata/chi_sim.traineddata ../tesstutorial/chi_simeval/chi_sim.

This command is not complete.

When you want to extract a component you have to give its name. The following works:

combine_tessdata -e ../tessdata_best/chi_sim.traineddata ../tesstutorial/chi_sim.lstm

Extracting tessdata components from ../tessdata_best/chi_sim.traineddata
Wrote ../tesstutorial/chi_sim.lstm
Version string:4.00.00alpha:chi_sim:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
0:config:size=1966, offset=192
17:lstm:size=12152851, offset=2158
18:lstm-punc-dawg:size=282, offset=12155009
19:lstm-word-dawg:size=590634, offset=12155291
20:lstm-number-dawg:size=82, offset=12745925
21:lstm-unicharset:size=258834, offset=12746007
22:lstm-recoder:size=72494, offset=13004841
23:version:size=84, offset=13077335

If you want to unpack the whole traineddata, then the name is not required.

combine_tessdata -u ../tessdata_best/chi_sim.traineddata ../tesstutorial/chi_sim.

Extracting tessdata components from ../tessdata_best/chi_sim.traineddata
Wrote ../tesstutorial/chi_sim.config
Wrote ../tesstutorial/chi_sim.lstm
Wrote ../tesstutorial/chi_sim.lstm-punc-dawg
Wrote ../tesstutorial/chi_sim.lstm-word-dawg
Wrote ../tesstutorial/chi_sim.lstm-number-dawg
Wrote ../tesstutorial/chi_sim.lstm-unicharset
Wrote ../tesstutorial/chi_sim.lstm-recoder
Wrote ../tesstutorial/chi_sim.version
Version string:4.00.00alpha:chi_sim:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
0:config:size=1966, offset=192
17:lstm:size=12152851, offset=2158
18:lstm-punc-dawg:size=282, offset=12155009
19:lstm-word-dawg:size=590634, offset=12155291
20:lstm-number-dawg:size=82, offset=12745925
21:lstm-unicharset:size=258834, offset=12746007
22:lstm-recoder:size=72494, offset=13004841
23:version:size=84, offset=13077335
Shreeshrii commented 5 years ago

The error message: tesseract::TessdataManager::TessdataTypeFromFileName(filename, &type):Error:Assert failed:in file tessdatamanager.cpp, line 298

@stweil Regarding your suggestion about more meaningful logging and error codes, this Assert could be changed to a more descriptive error message.