tesseract-ocr / langdata

Source training data for Tesseract for lots of languages
Apache License 2.0
826 stars 886 forks source link

Can't encode transcription #147

Closed peterbence3 closed 4 years ago

peterbence3 commented 4 years ago

Unable to fine-tune Arabic model for font 'Andalus', getting this error:

Encoding of string failed! Failure bytes: 26 26
Can't encode transcription: 'و ىدتنملا ىدتنم الإ دق عيضاوملا ؟؟ عيقوتلا ليجستلا &&' in language ''
Encoding of string failed! Failure bytes: 3d 3d 20 ffffffd9 ffffff89 ffffffd9 ffffff81 20 ffffffd9 ffffff88 ffffffd8 ffffffa3 20 ffffffd9 ffffff84 ffffffd8 ffffffa8 ffffffd9 ffffff82 20 ffffffd9 ffffff89 ffffffd8 ffffffaf ffffffd8 ffffffaa ffffffd9 ffffff86 ffffffd9 ffffff85 ffffffd9 ffffff84 ffffffd8 ffffffa7 20 ffffffd9 ffffff86 ffffffd9 ffffff85 20 ffffffd9 ffffff86 ffffffd9 ffffff88 ffffffd9 ffffff83 ffffffd8 ffffffaa 20 ffffffd8 ffffffa9 ffffffd8 ffffffad ffffffd9 ffffff81 ffffffd8 ffffffb5 ffffffd9 ffffff84 ffffffd8 ffffffa7 20 ffffffd8 ffffffa9 ffffffd9 ffffff83 ffffffd8 ffffffb1 ffffffd8 ffffffa7 ffffffd8 ffffffb4 ffffffd9 ffffff85 ffffffd9 ffffff84 ffffffd8 ffffffa7

Please note that the line making the error is the pre-last line in the ara.training_txt file, that contains: && التسجيل التوقيع ؟؟ المواضيع قد إلا منتدى المنتدى و

I'm using langdata_lstm for generating my training data and the ara.traineddata to continue from.

generating data:

../tesseract/src/training/tesstrain.sh --fonts_dir fonts/win7df \
         --fontlist 'Andalus' \
         --lang ara \
         --linedata_only \
         --langdata_dir ../langdata_lstm \
         --tessdata_dir ../tesseract/tessdata \
         --save_box_tiff \
         --maxpages 10 \
         --output_dir train

extracting old lstm: combine_tessdata -e ../tesseract/tessdata/ara.traineddata ara.lstm

fine-tuning:

rm -rf output/*
OMP_THREAD_LIMIT=8 lstmtraining \
    --continue_from ara.lstm \
    --model_output output/araNewModel \
    --traineddata ../tesseract/tessdata/ara.traineddata \
    --train_listfile train/ara.training_files.txt \
    --max_iterations 400

I'd checked the generated train data, where everything seems to be good, and tiff files includes all the train_text lines including the line making the error. I'd also tried to generate train data and fine tune for different fonts like 'Arial' and 'Tahoma' but still getting the same error.

I was thinking about removing the error line from the train_text file, but I don't know if it is safe or not. Besides, I think that 80 lines for training Arabic models is very small, isn't it?!!! So what if I decided to train for more lines of data, what should I do, and what files would be affected in such case?

Regards

amitdo commented 4 years ago

Besides, I think that 80 lines for training Arabic models is very small, isn't it?!!!

https://github.com/tesseract-ocr/langdata_lstm/issues/6

peterbence3 commented 4 years ago

@amitdo is there any tutorial or documentation on how to generate a new langdata? I can contribute making the Arabic version.

stweil commented 4 years ago

This is a duplicate of https://github.com/tesseract-ocr/tesseract/issues/2695.