Open Bipin8469 opened 4 years ago
Try the following:
rm -rf output/*
lstmtraining \
--debug_interval -1 \
--continue_from eng.lstm \
--model_output output/times \
--traineddata tesseract/tessdata/eng.traineddata \
--train_listfile train/eng.training_files.txt \
--max_iterations 400
Make sure that you are using eng.traineddata from tessdata_best repo.
thanks for your response , i replaced my finetune command with your command but still i am not getting the correct ocr. again my question is how can i use my tif\box file for training.
how can i use my tif\box file for training.
What do you mean by my tif/box file?
It seems to me that you are generating box/tif files using tesstrain.sh with fonts and training_text.
sir i took the correct text of the pdf file and placed that in langdata_lstm/eng/eng.training_text then i followed the above steps and performed ocr using the new trainneddata still i didnt got the correct ocr . therefore i created box file for the wrong ocr text . so i want to use this box file in my training . can you please help me with this.
OPTIONAL flag for specifying directory with user specified box/tiff pairs.
Files should be named similar to ${LANG_CODE}.${fontname}.exp${EXPOSURE}.box/tif
--my_boxtiff_dir MY_BOXTIFF_DIR # Location of user specified box/tiff files.
2.1. Generate the .lstmf file:
tesseract eng.Times_New_Roman.exp0.tif eng.Times_New_Roman.exp0 -l eng --psm 6 lstm.train
2.2. Generate the .lstm file from the eng.traineddata.
combine_tessdata -e tesseract/tessdata/eng.traineddata eng.lstm
2.3. Create a txt file name with eng.training_files.txt. The content in the file is the path of the .lstmf in your PC.
THE_PATH_OF_.lstmf_FILE/eng.Times_New_Roman.exp0.lstmf
2.4. start train
lstmtraining \
--model_output output/times\
--continue_from THE_PATH_OF_.lstm_FILE\eng.lstm\
--train_listfile THE_PATH_OF_.txt_FILE/eng.training_files.txt\
--traineddata tesseract/tessdata/eng.traineddata\
--debug_interval -1
--max_iterations 400
2.5. Generate the times.traineddata
lstmtraining\
--stop_training \
--continue_from THE_PATH_OF_CHECKPOINT_FILE/times_checkpoint \
--traineddata tesseract/tessdata/eng.traineddata\
--model_output output/times
Hope this will help you!
Environment
Tesseract Version:
Platform:
Current Behavior:
tesstrain.sh --fonts_dir fonts \ --fontlist 'Times New Roman ' \ --lang eng \ --linedata_only \ --langdata_dir langdata_lstm \ --tessdata_dir tesseract/tessdata \ --save_box_tiff \ --maxpages 10 \ --output_dir train
rm -rf output/* OMP_THREAD_LIMIT=8 lstmtraining \ --continue_from eng.lstm \ --model_output output/times \ --traineddata tesseract/tessdata/eng.traineddata \ --train_listfile train/eng.training_files.txt \ --max_iterations 100
lstmeval --model output/times_checkpoint \ --traineddata tesseract/tessdata/eng.traineddata \ --eval_listfile train/eng.training_files.txt
Now with the following command i used my traineddata to ocr a file.
I followed the above steps then also i m not able to get the accurate ocr. Then i created box files for the wrong ocr text , Now i want to know how can i use my file for training. Can somebody help me or tell me how to solve my problems? thank you