Open wosiu opened 6 years ago
If you want a digits model, you should try to plus-minus train from the english model in tessdata_best. See wiki page regarding training.
I tried with eng.traineddata from tessdata_best. Same story - segmentation fault during traininglstm. Command:
lstmtraining --model_output liczniki_model --traineddata ../tessdata/tessdata_best/tmp/eng.traineddata --old_traineddata ../tessdata/tessdata_best/eng.traineddata --train_listfile liczniki.training_files.txt --max_iterations 4000 --target_error_rate 0.1 --continue_from ../tessdata/tessdata_best/tmp/eng.lstm
Where tessdata/tessdata_best/tmp/eng.traineddata is combined with edited eng.lstm-unicharset in a way that only digits are left.
============ CASE 2 - DOES NOT WORK ============== Before training I edit digitsbest2.lstm-unicharset in way that I remove few lines and upgrade counter at the beginning,
The unicharset, lstmf files, --train_listfile liczniki.training_files.txt - ALL these need to be in sync. You are removing lines from unicharset, but if the lstmf files have those characters, it will NOT work.
Please follow the proper training procedure as mentioned in the wiki.
I am sure lstmf files contains only those characters which are in new unicharset file. Moreover when I use unicharset generated during lstmtraining I've got the same result - Seg fault. Note that training procedure mentioned in the wiki in the +/- character fine tunning assumes that image/box are generated from text using text2image which is not my case. I've got my own images with boxfiles which I inject into a process of generating lstmf files. Btw, maybe there is some problem? Should I get traineddata file during lstmf files generating? Currently in this process I get only lstmf files for each image/box pair and one unicharset. And what I'm doing next is I take old traineddata and replace unicharset with the new one. And such unicharset I use after --old_traineddata flag.
2018-01-26 13:14 GMT+01:00 Shreeshrii notifications@github.com:
============ CASE 2 - DOES NOT WORK ============== Before training I edit digitsbest2.lstm-unicharset in way that I remove few lines and upgrade counter at the beginning,
The unicharset, lstmf files, --train_listfile liczniki.training_files.txt
- ALL these need to be in sync. You are removing lines from unicharset, but if the lstmf files have those characters, it will NOT work.
Please follow the proper training procedure as mentioned in the wiki.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1283#issuecomment-360769903, or mute the thread https://github.com/notifications/unsubscribe-auth/AEvAJJy2p8ynZnr6SPqgH4AbVxHgnoHvks5tOcG7gaJpZM4Rljfg .
Should I get traineddata file during lstmf files generating? Currently in this process I get only lstmf files for each image/box pair and one unicharset.
The LSTM training process now requires a starter traineddata.
tesstrain.sh process creates it after the lstmf files are created, by using the combine_lang_model
.
If you want to follow a custom path for training, you should make sure your process has all the required steps (check tesstrain.sh and related batch files)
Currently tesseract does not support training from your own box/tif pairs.
Tesseract version: current master (c9169e5a), also tested on ebbfc3ae8df85c mode: LSTM
TL;DR It seems there is something wrong during lstmtraining if any line is removed from original unicharset.
I'm trying to fine tune digitsbest.traineddata model. I use following flow of commands:
============ CASE 1 - it works ==============
For extracting digitsbest.lstm :
For training:
All good.
============ CASE 2 - DOES NOT WORK ============== Before training I edit digitsbest2.lstm-unicharset in way that I remove few lines and upgrade counter at the beginning, so FROM:
TO:
Then I combine back to traineddata using:
Then I run lstmtraining similarly to CASE 1 but with new traineddata file:
And get:
"Naruszenie ochrony pamięci (zrzut pamięci)" - means "Segmentation fault (core dumped)"