tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
599 stars 178 forks source link

training tesseract for persian (fas) language #315

Closed m-kafiyan closed 1 year ago

m-kafiyan commented 1 year ago

I am training tesseract version 5.2.0 from scratch with about 40000 data and 3 new fonts for the Farsi language. Although it seems that every step is correct, I got a high error rate BCER starting from 99.76 to 91.93 after 10000 iterations.

first question: As I didn't get high accuracy, I decided to fine-tune the fas model by using START_MODEL command. But when I checked the lstmtraining --help, I found continue_from command, and now I am confused about what command I should use for fine-tuning.

second question: Also I want to know the reason behind this poor CER I got?

` !export OMP_THREAD_LIMIT=16

 !make training \
    START_MODEL=fas \
    MODEL_NAME=dori \
    LANG_TYPE=RTL \
    LANG_CODE=fas \
    TESSDATA=/usr/share/tesseract-ocr/5/tessdata \
    DATA_DIR=../data \
    MAX_ITERATIONS=10000 `

Here is the log:

lstmtraining \
  --debug_interval 0 \
  --traineddata ../data/dori/dori.traineddata \
  --old_traineddata /usr/share/tesseract-ocr/5/tessdata/fas.traineddata \
  --continue_from ../data/fas/dori.lstm \
  --learning_rate 0.0001 \
  --model_output ../data/dori/checkpoints/dori \
  --train_listfile ../data/dori/list.train \
  --eval_listfile ../data/dori/list.eval \
  --max_iterations 10000 \
  --target_error_rate 0.01
Loaded file ../data/dori/checkpoints/dori_checkpoint, unpacking...
Successfully restored trainer from ../data/dori/checkpoints/dori_checkpoint
2 Percent improvement time=2100, best error was 100 @ 0
At iteration 2100/2100/2100, Mean rms=10.389000%, delta=55.606000%, BCER train=99.275000%, BWER train=99.974000%, skip ratio=0.000000%,  New best BCER = 99.275000 wrote checkpoint.

2 Percent improvement time=2200, best error was 100 @ 0
At iteration 2200/2200/2200, Mean rms=10.375000%, delta=55.467000%, BCER train=99.110000%, BWER train=99.974000%, skip ratio=0.000000%,  New best BCER = 99.110000 wrote checkpoint.

2 Percent improvement time=2300, best error was 100 @ 0
At iteration 2300/2300/2300, Mean rms=10.367000%, delta=55.260000%, BCER train=98.969000%, BWER train=99.981000%, skip ratio=0.000000%,  New best BCER = 98.969000 wrote checkpoint.

2 Percent improvement time=2400, best error was 100 @ 0
At iteration 2400/2400/2400, Mean rms=10.327000%, delta=54.593000%, BCER train=98.839000%, BWER train=99.963000%, skip ratio=0.000000%,  New best BCER = 98.839000 wrote checkpoint.

2 Percent improvement time=2500, best error was 100 @ 0
At iteration 2500/2500/2500, Mean rms=10.334000%, delta=54.687000%, BCER train=98.810000%, BWER train=99.963000%, skip ratio=0.000000%,  New best BCER = 98.810000 wrote checkpoint.

2 Percent improvement time=2600, best error was 100 @ 0
At iteration 2600/2600/2600, Mean rms=10.316000%, delta=54.450000%, BCER train=98.760000%, BWER train=99.963000%, skip ratio=0.000000%,  New best BCER = 98.760000 wrote checkpoint.
 .
 .
 .

2 Percent improvement time=1300, best error was 95.008 @ 4900
At iteration 6200/6200/6200, Mean rms=8.224000%, delta=28.314000%, BCER train=92.917000%, BWER train=99.782000%, skip ratio=0.000000%,  New best BCER = 92.917000 wrote checkpoint.

2 Percent improvement time=1400, best error was 95.008 @ 4900
At iteration 6300/6300/6300, Mean rms=8.084000%, delta=27.189000%, BCER train=92.844000%, BWER train=99.792000%, skip ratio=0.000000%,  New best BCER = 92.844000 wrote checkpoint.

2 Percent improvement time=1300, best error was 94.822 @ 5100
At iteration 6400/6400/6400, Mean rms=7.941000%, delta=25.781000%, BCER train=92.732000%, BWER train=99.764000%, skip ratio=0.000000%,  New best BCER = 92.732000 wrote checkpoint.

2 Percent improvement time=1300, best error was 94.706 @ 5200
At iteration 6500/6500/6500, Mean rms=7.678000%, delta=23.634000%, BCER train=92.572000%, BWER train=99.788000%, skip ratio=0.000000%,  New best BCER = 92.572000 wrote checkpoint.

2 Percent improvement time=1400, best error was 94.706 @ 5200
At iteration 6600/6600/6600, Mean rms=7.737000%, delta=23.973000%, BCER train=92.541000%, BWER train=99.764000%, skip ratio=0.000000%,  New best BCER = 92.541000 wrote checkpoint.

2 Percent improvement time=1200, best error was 94.279 @ 5500
At iteration 6700/6700/6700, Mean rms=7.427000%, delta=21.285000%, BCER train=92.173000%, BWER train=99.768000%, skip ratio=0.000000%,  New best BCER = 92.173000 wrote checkpoint.

2 Percent improvement time=1300, best error was 94.279 @ 5500
At iteration 6800/6800/6800, Mean rms=7.496000%, delta=21.825000%, BCER train=92.031000%, BWER train=99.734000%, skip ratio=0.000000%,  New best BCER = 92.031000 wrote checkpoint.

At iteration 7700/7700/7700, Mean rms=7.165000%, delta=18.840000%, BCER train=92.014000%, BWER train=99.716000%, skip ratio=0.000000%,  New worst BCER = 92.014000 wrote checkpoint.

At iteration 7800/7800/7800, Mean rms=7.118000%, delta=18.520000%, BCER train=92.234000%, BWER train=99.723000%, skip ratio=0.000000%,  New worst BCER = 92.234000 wrote checkpoint.

At iteration 7900/7900/7900, Mean rms=7.251000%, delta=19.487000%, BCER train=92.045000%, BWER train=99.722000%, skip ratio=0.000000%,  New worst BCER = 92.045000 wrote checkpoint.

Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 8000/8000/8000, Mean rms=7.093000%, delta=18.310000%, BCER train=92.423000%, BWER train=99.640000%, skip ratio=0.000000%,  New worst BCER = 92.423000At iteration 3900, stage 0, BCER eval=96.652818, BWER eval=100.000000 wrote checkpoint.

At iteration 8100/8100/8100, Mean rms=6.990000%, delta=17.517000%, BCER train=92.264000%, BWER train=99.640000%, skip ratio=0.000000%,  wrote checkpoint.

Warning: LSTMTrainer deserialized an LSTMRecognizer!
At iteration 8200/8200/8200, Mean rms=6.898000%, delta=16.840000%, BCER train=92.487000%, BWER train=99.535000%, skip ratio=0.000000%,  New worst BCER = 92.487000At iteration 7000, stage 0, BCER eval=96.652818, BWER eval=100.000000 wrote checkpoint.

At iteration 8300/8300/8300, Mean rms=6.837000%, delta=16.331000%, BCER train=92.338000%, BWER train=99.526000%, skip ratio=0.000000%,  wrote checkpoint.
.
.
.
At iteration 9900/9900/9900, Mean rms=6.816000%, delta=16.097000%, BCER train=91.903000%, BWER train=99.468000%, skip ratio=0.000000%,  wrote checkpoint.

At iteration 10000/10000/10000, Mean rms=6.888000%, delta=16.671000%, BCER train=91.934000%, BWER train=99.423000%, skip ratio=0.000000%,  wrote checkpoint.

Finished! Selected model with minimal training error rate (BCER) = 91.691
lstmtraining \
--stop_training \`

Third question

I want to know more about the features of data that tesseract trained on it. Are there any differences between this data on tesseract 5 and 4? are they just line? are they contain noise? Is there any connection and dependency between the word of each line?

Forth question After I searched, I found that the default batch size is 1. Does it mean that the tesseract 5 trained with batch size 1? How can I change it? Note: I already read this link providing information about the data

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.