mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
747 stars 131 forks source link

[Not an Issue] How to keep the trained model as close as possible to groundtruth #646

Open johnlockejrr opened 1 month ago

johnlockejrr commented 1 month ago

I try to train a segmentation model for a modern printed Judaeo-Arabic dataset. The problem I face is that in the trained model I mainly loose vowel signs below the line. What can be done? I tried from scratch training and finetuning.

ketos segtrain --line-width 10 -mr Main:textzone --precision 16 -d cuda:0 -f page -t output.txt --resize both -tl -i /home/incognito/kraken-train/teyman_print/biblialong02_se3_2_tl.mlmodel -q early --min-epochs 80 -o /home/incognito/kraken-train/teyman_print/teyman_print_scr_cl/teyman_print_tl_v3

Manual segmentation as groundtruth: manual

Segmentation with the new trained model (small data, is preliminary): trained

johnlockejrr commented 1 month ago

Should I try to train it as center line and not top line as normal for Hebrew?

dstoekl commented 1 month ago

it will not help I think. use the api to improve the polyons by calculating average line distance and extrapolating from there.

johnlockejrr commented 1 month ago

it will not help I think. use the api to improve the polyons by calculating average line distance and extrapolating from there.

Is not a problem with the dataset but the model output. Using API to do what? The model should perform better.

Maybe you are aware of a Hebrew segmentation model that can properly handle nikkud and cantillation?