tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
630 stars 184 forks source link

Question: Training from Startmodel - output layer? #246

Closed M3ssman closed 3 years ago

M3ssman commented 3 years ago

Hello there,

currently I'm wondering about the relation between the original netspec when using a start-model with the Makefile-approach. AFAIU, when using an existing model as start point, we cut off the the output layer and train on. But what exactly are we training? Only the last output layer?

Further, how can I effect the char categories of the output layer? Given my start model knows 120 chars (which I can deduce from its contained unicharset), but I have 300 chars in my trainingdata - is this somehow automated recognized?

wrznr commented 3 years ago

Hi @M3ssman, with respect to your second question: This is indeed handled automatically. You should see a line like

Code range changed from 302 to 305!

in the log.

Concerning your first question I have to admit that I do not know. Maybe @bertsky can help?

bertsky commented 3 years ago

AFAIU, when using an existing model as start point, we cut off the the output layer and train on.

That's not true IIUC.

With START_MODEL, tesstrain will run lstmtraining in the finetuning regime, which does not change the network topology at all, it just adapts the weights, and very slowly so (with smaller learning rate).

You can do what you describe with lstmtraining, it's the cutoff regime. But tesstrain does not directly support that yet (because there's no combination of --old_traineddata --continue_from with --net_spec, and especially no --append_index).

But what exactly are we training? Only the last output layer?

In the finetuning regime, all layers, in the cutoff regime, the appended layers only (whatever you choose them to be; you need to come up with a useful VGSL expression).

Further, how can I effect the char categories of the output layer? Given my start model knows 120 chars (which I can deduce from its contained unicharset), but I have 300 chars in my trainingdata - is this somehow automated recognized?

As @wrznr pointed out, this is (thankfully) taken care of by tesstrain completely. It analyses all GT texts and extracts its characterset, and unifies that with the unicharset of START_MODEL. (If unexpected strings appear, they will be warned about loudly with Can't encode transcription: ... in language ... during training.)

However, I think the current default --norm_mode 2 is wrong for many Western scripts (including historical variants). IIUC it encodes combining characters as extra symbol, not in combination with the base characters, i.e. not glyphs. But that's another issue.

bertsky commented 3 years ago

However, I think the current default --norm_mode 2 is wrong for many Western scripts (including historical variants). IIUC it encodes combining characters as extra symbol, not in combination with the base characters, i.e. not glyphs. But that's another issue.

Cf. #254

You can do what you describe with lstmtraining, it's the cutoff regime. But tesstrain does not directly support that yet (because there's no combination of --old_traineddata --continue_from with --net_spec, and especially no --append_index).

Cf. #255

M3ssman commented 3 years ago

@wrznr The Code range changed from ... message is printed, but I'm not convinced that this is really respected.

I try to do some fine-tuning training on the existing UB-Mannheim gt4hist model (which is stored as gt4hist_5000k) with a special, brand new historical newspaper data set (4.000+) that only includes german fractur letters, interpunctuations and numerical chars and nothing else (even no antiqua/art deco fonts) with the tesstrain Makefile-workflow.

It says:

Wrote unicharset file /home/gitlab-runner/builds/Zsfedxu6/0/ulb/ulb-ocr-training-zd1/data/frk_ulbzd1/my.unicharset
merge_unicharsets /home/gitlab-runner/builds/Zsfedxu6/0/ulb/ulb-ocr-training-zd1/data/gt4hist_5000k/frk_ulbzd1.lstm-unicharset /home/gitlab-runner/builds/Zsfedxu6/0/ulb/ulb-ocr-training-zd1/data/frk_ulbzd1/my.unicharset  "/home/gitlab-runner/builds/Zsfedxu6/0/ulb/ulb-ocr-training-zd1/data/frk_ulbzd1/unicharset"
Loaded unicharset of size 300 from file /home/gitlab-runner/builds/Zsfedxu6/0/ulb/ulb-ocr-training-zd1/data/gt4hist_5000k/frk_ulbzd1.lstm-unicharset
Loaded unicharset of size 106 from file /home/gitlab-runner/builds/Zsfedxu6/0/ulb/ulb-ocr-training-zd1/data/frk_ulbzd1/my.unicharset
Wrote unicharset file /home/gitlab-runner/builds/Zsfedxu6/0/ulb/ulb-ocr-training-zd1/data/frk_ulbzd1/unicharset.

...

Loaded unicharset of size 301 from file /home/gitlab-runner/builds/Zsfedxu6/0/ulb/ulb-ocr-training-zd1/data/frk_ulbzd1/unicharset

...

lstmtraining \
  --debug_interval 0 \
  --traineddata /home/gitlab-runner/builds/Zsfedxu6/0/ulb/ulb-ocr-training-zd1/data/frk_ulbzd1/frk_ulbzd1.traineddata \
  --old_traineddata /usr/share/tesseract-ocr/4.00/tessdata/gt4hist_5000k.traineddata \
  --continue_from /home/gitlab-runner/builds/Zsfedxu6/0/ulb/ulb-ocr-training-zd1/data/gt4hist_5000k/frk_ulbzd1.lstm \
  --learning_rate 0.0001 \
  --model_output /home/gitlab-runner/builds/Zsfedxu6/0/ulb/ulb-ocr-training-zd1/data/frk_ulbzd1/checkpoints/frk_ulbzd1 \
  --train_listfile /home/gitlab-runner/builds/Zsfedxu6/0/ulb/ulb-ocr-training-zd1/data/frk_ulbzd1/list.train \
  --eval_listfile /home/gitlab-runner/builds/Zsfedxu6/0/ulb/ulb-ocr-training-zd1/data/frk_ulbzd1/list.eval \
  --max_iterations 20000 \
  --target_error_rate 0.01 \
  --max_image_MB 12000
Loaded file /home/gitlab-runner/builds/Zsfedxu6/0/ulb/ulb-ocr-training-zd1/data/gt4hist_5000k/frk_ulbzd1.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 299 to 300!
Num (Extended) outputs,weights in Series:
  1,48,0,1:1, 0
Num (Extended) outputs,weights in Series:
  C3,3:9, 0
  Ft16:16, 160
Total weights = 160
  [C3,3Ft16]:16, 160
  Mp3,3:16, 0
  Lfys64:64, 20736
  Lfx96:96, 61824
  Lrx96:96, 74112
  Lfx384:384, 738816
  Fc300:300, 115500

The final output layer is slightly adopted but still about to map 300 chars and not only those 106 chars from the new training-set. Is there something wrong/missing in my approach anyway?

I'm using lstmtraining based on Tesseract 4.1.1

bertsky commented 3 years ago

@M3ssman if your goal is to throw out the characters that are not in your finetuning dataset, then you need to change the rules for the new unicharset. tesstrain always merges my.unicharset and the START_MODEL's unicharset, but you could simply replace the resulting unicharset with my.unicharset IINM (just before make training).

M3ssman commented 3 years ago

@bertsky Yes, thanks, that's what is was looking for. Just do not combine the charsets, an then it behaves like @wrznr pointed out (Code range changed from 299 to 107!). And now I can do some experiments with fine-tuning on specific charsets!