tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
599 stars 178 forks source link

Training does not result good on RTL languages #355

Closed faizan1041 closed 8 months ago

faizan1041 commented 8 months ago

Hi @Shreeshrii and others,

I'm trying to train on Arabic dataset and the results get worse after training:


make training GROUND_TRUTH_DIR=data/gt/arabic-dataset/ \
START_MODEL=ckbLayer MODEL_NAME=ckbLayer2 MAX_ITERATIONS=1000 \
BUILD_TYPE=Layer 

I tried different start models as well, like ara.traineddata from best and fast repos but no luck. Screenshot from 2023-10-11 14-59-38

in the image's gt.txt I have:

محمد عبد الله ريشم خان This is what in the box file is:


م 0 0 1728 220 0
ح 0 0 1728 220 0
م 0 0 1728 220 0
د 0 0 1728 220 0
  0 0 1728 220 0
ع 0 0 1728 220 0
ب 0 0 1728 220 0
د 0 0 1728 220 0
  0 0 1728 220 0
ا 0 0 1728 220 0
ل 0 0 1728 220 0
ل 0 0 1728 220 0
ه 0 0 1728 220 0
  0 0 1728 220 0
ر 0 0 1728 220 0
ي 0 0 1728 220 0
ش 0 0 1728 220 0
م 0 0 1728 220 0
  0 0 1728 220 0
خ 0 0 1728 220 0
ا 0 0 1728 220 0
ن 0 0 1728 220 0
     0 0 1728 220 0

Result from the original model: محمد عبد اللە ۔ ر یشم خان Result from the model trained for 1000 steps: يا لي ن دع ر

Remember when I train an English model on the English dataset, I see improvements on 1000 steps or even lower. Can you guide what is the issue?

faizan1041 commented 8 months ago

I fixed the issue myself, the issue is the box files are reversed for the RTL languages, so you need to reverse the box files again to match LTR. Here is the script I made for this: https://github.com/faizan1041/tesstrain_helpers/blob/main/reverse_box_files.py

faizan1041 commented 8 months ago

Anyone facing the same issue can generate the box files and run the above script, which will replace the box files content in the revered order.