Closed M3ssman closed 3 years ago
Hi @M3ssman, with respect to your second question: This is indeed handled automatically. You should see a line like
Code range changed from 302 to 305!
in the log.
Concerning your first question I have to admit that I do not know. Maybe @bertsky can help?
AFAIU, when using an existing model as start point, we cut off the the output layer and train on.
That's not true IIUC.
With START_MODEL
, tesstrain will run lstmtraining in the finetuning regime, which does not change the network topology at all, it just adapts the weights, and very slowly so (with smaller learning rate).
You can do what you describe with lstmtraining, it's the cutoff regime. But tesstrain does not directly support that yet (because there's no combination of --old_traineddata --continue_from
with --net_spec
, and especially no --append_index
).
But what exactly are we training? Only the last output layer?
In the finetuning regime, all layers, in the cutoff regime, the appended layers only (whatever you choose them to be; you need to come up with a useful VGSL expression).
Further, how can I effect the char categories of the output layer? Given my start model knows 120 chars (which I can deduce from its contained unicharset), but I have 300 chars in my trainingdata - is this somehow automated recognized?
As @wrznr pointed out, this is (thankfully) taken care of by tesstrain completely. It analyses all GT texts and extracts its characterset, and unifies that with the unicharset of START_MODEL
. (If unexpected strings appear, they will be warned about loudly with Can't encode transcription: ... in language ...
during training.)
However, I think the current default --norm_mode 2
is wrong for many Western scripts (including historical variants). IIUC it encodes combining characters as extra symbol, not in combination with the base characters, i.e. not glyphs. But that's another issue.
However, I think the current default
--norm_mode 2
is wrong for many Western scripts (including historical variants). IIUC it encodes combining characters as extra symbol, not in combination with the base characters, i.e. not glyphs. But that's another issue.
Cf. #254
You can do what you describe with lstmtraining, it's the cutoff regime. But tesstrain does not directly support that yet (because there's no combination of
--old_traineddata --continue_from
with--net_spec
, and especially no--append_index
).
Cf. #255
@wrznr The Code range changed from ...
message is printed, but I'm not convinced that this is really respected.
I try to do some fine-tuning training on the existing UB-Mannheim gt4hist model (which is stored as gt4hist_5000k
) with a special, brand new historical newspaper data set (4.000+) that only includes german fractur letters, interpunctuations and numerical chars and nothing else (even no antiqua/art deco fonts) with the tesstrain Makefile-workflow.
It says:
Wrote unicharset file /home/gitlab-runner/builds/Zsfedxu6/0/ulb/ulb-ocr-training-zd1/data/frk_ulbzd1/my.unicharset
merge_unicharsets /home/gitlab-runner/builds/Zsfedxu6/0/ulb/ulb-ocr-training-zd1/data/gt4hist_5000k/frk_ulbzd1.lstm-unicharset /home/gitlab-runner/builds/Zsfedxu6/0/ulb/ulb-ocr-training-zd1/data/frk_ulbzd1/my.unicharset "/home/gitlab-runner/builds/Zsfedxu6/0/ulb/ulb-ocr-training-zd1/data/frk_ulbzd1/unicharset"
Loaded unicharset of size 300 from file /home/gitlab-runner/builds/Zsfedxu6/0/ulb/ulb-ocr-training-zd1/data/gt4hist_5000k/frk_ulbzd1.lstm-unicharset
Loaded unicharset of size 106 from file /home/gitlab-runner/builds/Zsfedxu6/0/ulb/ulb-ocr-training-zd1/data/frk_ulbzd1/my.unicharset
Wrote unicharset file /home/gitlab-runner/builds/Zsfedxu6/0/ulb/ulb-ocr-training-zd1/data/frk_ulbzd1/unicharset.
...
Loaded unicharset of size 301 from file /home/gitlab-runner/builds/Zsfedxu6/0/ulb/ulb-ocr-training-zd1/data/frk_ulbzd1/unicharset
...
lstmtraining \
--debug_interval 0 \
--traineddata /home/gitlab-runner/builds/Zsfedxu6/0/ulb/ulb-ocr-training-zd1/data/frk_ulbzd1/frk_ulbzd1.traineddata \
--old_traineddata /usr/share/tesseract-ocr/4.00/tessdata/gt4hist_5000k.traineddata \
--continue_from /home/gitlab-runner/builds/Zsfedxu6/0/ulb/ulb-ocr-training-zd1/data/gt4hist_5000k/frk_ulbzd1.lstm \
--learning_rate 0.0001 \
--model_output /home/gitlab-runner/builds/Zsfedxu6/0/ulb/ulb-ocr-training-zd1/data/frk_ulbzd1/checkpoints/frk_ulbzd1 \
--train_listfile /home/gitlab-runner/builds/Zsfedxu6/0/ulb/ulb-ocr-training-zd1/data/frk_ulbzd1/list.train \
--eval_listfile /home/gitlab-runner/builds/Zsfedxu6/0/ulb/ulb-ocr-training-zd1/data/frk_ulbzd1/list.eval \
--max_iterations 20000 \
--target_error_rate 0.01 \
--max_image_MB 12000
Loaded file /home/gitlab-runner/builds/Zsfedxu6/0/ulb/ulb-ocr-training-zd1/data/gt4hist_5000k/frk_ulbzd1.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 299 to 300!
Num (Extended) outputs,weights in Series:
1,48,0,1:1, 0
Num (Extended) outputs,weights in Series:
C3,3:9, 0
Ft16:16, 160
Total weights = 160
[C3,3Ft16]:16, 160
Mp3,3:16, 0
Lfys64:64, 20736
Lfx96:96, 61824
Lrx96:96, 74112
Lfx384:384, 738816
Fc300:300, 115500
The final output layer is slightly adopted but still about to map 300 chars and not only those 106 chars from the new training-set. Is there something wrong/missing in my approach anyway?
I'm using lstmtraining based on Tesseract 4.1.1
@M3ssman if your goal is to throw out the characters that are not in your finetuning dataset, then you need to change the rules for the new unicharset. tesstrain always merges my.unicharset
and the START_MODEL's unicharset, but you could simply replace the resulting unicharset
with my.unicharset
IINM (just before make training
).
@bertsky Yes, thanks, that's what is was looking for. Just do not combine the charsets, an then it behaves like @wrznr pointed out (Code range changed from 299 to 107!
). And now I can do some experiments with fine-tuning on specific charsets!
Hello there,
currently I'm wondering about the relation between the original netspec when using a start-model with the Makefile-approach. AFAIU, when using an existing model as start point, we cut off the the output layer and train on. But what exactly are we training? Only the last output layer?
Further, how can I effect the char categories of the output layer? Given my start model knows 120 chars (which I can deduce from its contained unicharset), but I have 300 chars in my trainingdata - is this somehow automated recognized?