Closed peterbence3 closed 5 years ago
Besides, I think that 80 lines for training Arabic models is very small, isn't it?!!!
@amitdo is there any tutorial or documentation on how to generate a new langdata? I can contribute making the Arabic version.
This is a duplicate of https://github.com/tesseract-ocr/tesseract/issues/2695.
Unable to fine-tune Arabic model for font 'Andalus', getting this error:
Please note that the line making the error is the pre-last line in the
ara.training_txt
file, that contains:&& التسجيل التوقيع ؟؟ المواضيع قد إلا منتدى المنتدى و
I'm using
langdata_lstm
for generating my training data and theara.traineddata
to continue from.generating data:
extracting old lstm:
combine_tessdata -e ../tesseract/tessdata/ara.traineddata ara.lstm
fine-tuning:
I'd checked the generated train data, where everything seems to be good, and tiff files includes all the train_text lines including the line making the error. I'd also tried to generate train data and fine tune for different fonts like 'Arial' and 'Tahoma' but still getting the same error.
I was thinking about removing the error line from the train_text file, but I don't know if it is safe or not. Besides, I think that 80 lines for training Arabic models is very small, isn't it?!!! So what if I decided to train for more lines of data, what should I do, and what files would be affected in such case?
Regards