Open sifdinNh opened 7 months ago
so i want to finetune ara.traineddata in the traineddata_best repo to handle extended words like the this :
to do that i made a list of lines with the same format like this :
............. الســــــــيد العضـــــو د. عــــلي العتيبــــــي: الســــــــيد العضـــــو جــــمال الحــــربي: الســــــــيد العضـــــو د. خالــــد الفيصـــــل: الســـــــــيد العضـــــو تركـــــي المطيــــري: ..............
i started by genereting ground truth files with .tif images and.box files
.tif
.box
then started training with this:
make training MODEL_NAME=ara_new TESSDATA=../tesseract/tessdata START_MODEL=ara MAX_ITERATIONS=10000 LANG_TYPE=RTL
i started with 99%BCER and stoped when i had 24% BCER
when i came to test the traineddata file with evalute it with best traineddata ara.trainedata
i got a poor result
this is the result of best traineddata for arabic: it's giving me almost 90% accuracy
but when i tested the new trained file this is the result :
it's like doesn't recognize anything and the main the reason i started this is to finetune it to better accuracy
@zdenop
uncertain if the issue arises because the model was trained on multiline in tiff, but have you attempted fine tuning with one line text in images? give it a try if not yet and share results with us
tiff
so i want to finetune ara.traineddata in the traineddata_best repo to handle extended words like the this :
to do that i made a list of lines with the same format like this :
i started by genereting ground truth files with
.tif
images and.box
filesthen started training with this:
make training MODEL_NAME=ara_new TESSDATA=../tesseract/tessdata START_MODEL=ara MAX_ITERATIONS=10000 LANG_TYPE=RTL
i started with 99%BCER and stoped when i had 24% BCER
when i came to test the traineddata file with evalute it with best traineddata ara.trainedata
i got a poor result
this is the result of best traineddata for arabic:
it's giving me almost 90% accuracy
but when i tested the new trained file this is the result :![sample_5](https://github.com/tesseract-ocr/tesstrain/assets/36017867/1e614c44-3080-4b21-8b41-c09a4a49a868)
it's like doesn't recognize anything and the main the reason i started this is to finetune it to better accuracy