tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.37k stars 9.52k forks source link

Have to train tesseract 4 with my box files #2858

Open Bipin8469 opened 4 years ago

Bipin8469 commented 4 years ago

Environment

Tesseract Version:

 tesseract 5.0.0-alpha-554-g9ed3
 leptonica-1.75.3
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0

 Found AVX2
 Found AVX
 Found FMA
 Found SSE

Platform:

Ubuntu 16.04

Current Behavior:

folder structure :
    main_folder : test
           *note: copy eng.lstm from the following path   (main_folder/tesseract/tessdata/eng.lstm) and paste in the main_folder.
    sub_folders : fonts
                    |-> times-new-roman.ttf
              langdata_lstm (taken from official tesseract )
              output (empty file)
              tesseract-(taken from official tesseract )
              train (empty file)
files :
    generate_training_data.sh
    finetune.sh
    combine.sh
    eval.sh
    extract_lstm.sh

command_1 : sh generate_training_data.sh
    rm -rf train/*

tesstrain.sh --fonts_dir fonts \ --fontlist 'Times New Roman ' \ --lang eng \ --linedata_only \ --langdata_dir langdata_lstm \ --tessdata_dir tesseract/tessdata \ --save_box_tiff \ --maxpages 10 \ --output_dir train

files i got :Created starter traineddata for LSTM training of language 'eng'. main_folder -> "train" folder

    "eng" folder : eng.charset_size=110.txt
                     eng.traineddata
                     eng.unicharset
    files : eng.Times_New_Roman.exp0.box
        eng.Times_New_Roman.exp0.lstmf
        eng.Times_New_Roman.exp0.tif
        eng.training_files.txt

command_2 : sh finetune.sh

rm -rf output/* OMP_THREAD_LIMIT=8 lstmtraining \ --continue_from eng.lstm \ --model_output output/times \ --traineddata tesseract/tessdata/eng.traineddata \ --train_listfile train/eng.training_files.txt \ --max_iterations 100

files i got : wrote best model:output/times_0.249_9_100.checkpoint wrote checkpoint.

    "output" folder: times_checkpoint
                       times_0.249_9_100.checkpoint

command_3 :  sh combine.sh 
    lstmtraining --stop_training \
    --continue_from output/times_checkpoint \
    --traineddata tesseract/tessdata/eng.traineddata \
    --model_output output/times.traineddata

outcome:-Loaded file output/times_checkpoint, unpacking...

command_4 : sh eval.sh                    

lstmeval --model output/times_checkpoint \ --traineddata tesseract/tessdata/eng.traineddata \ --eval_listfile train/eng.training_files.txt

files i got :   
       "output" folder : times.traineddata

Now with the following command i used my traineddata to ocr a file.

command : tesseract tif_file  output_filename  -l traineddata_name
output:

I followed the above steps then also i m not able to get the accurate ocr. Then i created box files for the wrong ocr text , Now i want to know how can i use my file for training. Can somebody help me or tell me how to solve my problems? thank you

Shreeshrii commented 4 years ago

Try the following:

rm -rf output/*
lstmtraining \
--debug_interval -1 \
--continue_from eng.lstm \
--model_output output/times \
--traineddata tesseract/tessdata/eng.traineddata \
--train_listfile train/eng.training_files.txt \
--max_iterations 400
Shreeshrii commented 4 years ago

Make sure that you are using eng.traineddata from tessdata_best repo.

Bipin8469 commented 4 years ago

thanks for your response , i replaced my finetune command with your command but still i am not getting the correct ocr. again my question is how can i use my tif\box file for training.

Shreeshrii commented 4 years ago

how can i use my tif\box file for training.

What do you mean by my tif/box file?

It seems to me that you are generating box/tif files using tesstrain.sh with fonts and training_text.

Bipin8469 commented 4 years ago

sir i took the correct text of the pdf file and placed that in langdata_lstm/eng/eng.training_text then i followed the above steps and performed ocr using the new trainneddata still i didnt got the correct ocr . therefore i created box file for the wrong ocr text . so i want to use this box file in my training . can you please help me with this.

livezingy commented 4 years ago
  1. Maybe you could try the parameter "my_boxtiff_dir" parameter in tesstrain.sh.
    OPTIONAL flag for specifying directory with user specified box/tiff pairs.
    Files should be named similar to ${LANG_CODE}.${fontname}.exp${EXPOSURE}.box/tif
    --my_boxtiff_dir MY_BOXTIFF_DIR # Location of user specified box/tiff files.
  2. If you have both box files and the corresponding tif files, You could train with your box files by running the following commands step by step:

2.1. Generate the .lstmf file:

tesseract eng.Times_New_Roman.exp0.tif eng.Times_New_Roman.exp0 -l eng --psm 6 lstm.train

2.2. Generate the .lstm file from the eng.traineddata.

combine_tessdata -e tesseract/tessdata/eng.traineddata eng.lstm

2.3. Create a txt file name with eng.training_files.txt. The content in the file is the path of the .lstmf in your PC.

THE_PATH_OF_.lstmf_FILE/eng.Times_New_Roman.exp0.lstmf

2.4. start train

lstmtraining \
--model_output output/times\
--continue_from THE_PATH_OF_.lstm_FILE\eng.lstm\
--train_listfile THE_PATH_OF_.txt_FILE/eng.training_files.txt\
--traineddata tesseract/tessdata/eng.traineddata\
--debug_interval -1 
--max_iterations 400

2.5. Generate the times.traineddata

lstmtraining\
 --stop_training \
--continue_from THE_PATH_OF_CHECKPOINT_FILE/times_checkpoint \
--traineddata tesseract/tessdata/eng.traineddata\
--model_output output/times

Hope this will help you!