tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
599 stars 178 forks source link

fine tuning arabic traineddata to solve extended words issue #362

Open sifdinNh opened 7 months ago

sifdinNh commented 7 months ago

so i want to finetune ara.traineddata in the traineddata_best repo to handle extended words like the this :

sample_9

to do that i made a list of lines with the same format like this :

.............
الســــــــيد العضـــــو د. عــــلي العتيبــــــي:
الســــــــيد العضـــــو جــــمال الحــــربي:
الســــــــيد العضـــــو د. خالــــد الفيصـــــل:
الســـــــــيد العضـــــو تركـــــي المطيــــري:
..............

i started by genereting ground truth files with .tif images and.box files

then started training with this:

make training MODEL_NAME=ara_new TESSDATA=../tesseract/tessdata START_MODEL=ara MAX_ITERATIONS=10000 LANG_TYPE=RTL

i started with 99%BCER and stoped when i had 24% BCER

when i came to test the traineddata file with evalute it with best traineddata ara.trainedata

i got a poor result

this is the result of best traineddata for arabic: sample_5 it's giving me almost 90% accuracy

but when i tested the new trained file this is the result : sample_5

it's like doesn't recognize anything and the main the reason i started this is to finetune it to better accuracy

sifdinNh commented 7 months ago

@zdenop

AhmadHakami commented 6 months ago

uncertain if the issue arises because the model was trained on multiline in tiff, but have you attempted fine tuning with one line text in images? give it a try if not yet and share results with us