tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.31k stars 9.52k forks source link

compute ctc target failed #2395

Open nijanthan0 opened 5 years ago

nijanthan0 commented 5 years ago

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

surasystem@surasystem:~$ lstmtraining --traineddata data/tamtrain/tamtrain.traineddata --old_traineddata tesseract/tessdata/tam.traineddata --continue_from data/tam/tam.lstm --net_spec '[Lfx256 O1c111]' --model_output data/checkpoints --learning_rate 20e-4 --train_listfile data/list.train --eval_listfile data/list.eval --max_iterations 3000 Loaded file data/tam/tam.lstm, unpacking... Warning: LSTMTrainer deserialized an LSTMRecognizer! Code range changed from 99 to 104! Num (Extended) outputs,weights in Series: 1,36,0,1:1, 0 Num (Extended) outputs,weights in Series: C3,3:9, 0 Ft16:16, 160 Total weights = 160 [C3,3Ft16]:16, 160 Mp3,3:16, 0 Lfys48:48, 12480 Lfx96:96, 55680 Lrx96:96, 74112 Lfx192:192, 221952 Fc104:104, 20072 Total weights = 384456 Previous null char=2 mapped to 103 Continuing from data/tam/tam.lstm Loaded 54/54 pages (1-54) of document data/ground-truth/out8.lstmf Loaded 57/57 pages (1-57) of document data/ground-truth/tam.TAMu_Kadambri.exp0.lstmf Loaded 20/20 pages (1-20) of document data/ground-truth/tam.Impact_Condensed.exp0.lstmf Loaded 8/8 pages (1-8) of document data/ground-truth/out5.lstmf Loaded 28/28 pages (1-28) of document data/ground-truth/out2.lstmf Loaded 57/57 pages (1-57) of document data/ground-truth/out3.lstmf Loaded 56/56 pages (1-56) of document data/ground-truth/out4.lstmf Loaded 55/55 pages (1-55) of document data/ground-truth/out9.lstmf Loaded 58/58 pages (1-58) of document data/ground-truth/out6.lstmf Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed!

what i want to do to overcome this issue..

Shreeshrii commented 5 years ago

--old_traineddata tesseract/tessdata/tam.traineddata

Is this file taken from tessdata_best repo?

lstmtraining --traineddata data/tamtrain/tamtrain.traineddata --old_traineddata tesseract/tessdata/tam.traineddata --continue_from data/tam/tam.lstm --net_spec '[Lfx256 O1c111]' --model_output data/checkpoints --learning_rate 20e-4 --train_listfile data/list.train --eval_listfile data/list.eval --max_iterations 3000

Run your command with --debug_level -1 and share console output and also the training_text used.

nijanthan0 commented 5 years ago

yes, I am using best tess data

nijanthan0 commented 5 years ago

lstmtraining --traineddata data/tamtrain/tamtrain.traineddata --old_traineddata tesseract/tessdata/tam.traineddata --continue_from data/tam/tam.lstm --net_spec '[Lfx256 O1c111]' --model_output data/checkpoints --debug_level -1 --learning_rate 20e-4 --train_listfile data/list.train --eval_listfile data/list.eval --max_iterations 3000 Loaded file data/checkpoints_checkpoint, unpacking... Successfully restored trainer from data/checkpoints_checkpoint Loaded 54/54 pages (1-54) of document data/ground-truth/out8.lstmf Loaded 20/20 pages (1-20) of document data/ground-truth/tam.Impact_Condensed.exp0.lstmf Loaded 8/8 pages (1-8) of document data/ground-truth/out5.lstmf Loaded 28/28 pages (1-28) of document data/ground-truth/out2.lstmf Loaded 58/58 pages (1-58) of document data/ground-truth/out6.lstmf Loaded 57/57 pages (1-57) of document data/ground-truth/out3.lstmf Loaded 55/55 pages (1-55) of document data/ground-truth/out9.lstmf Loaded 56/56 pages (1-56) of document data/ground-truth/out4.lstmf Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Encoding of string failed! Failure bytes: 23 20 2d ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff88 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffaf ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffaf ffffff8d 20 ffffffe0 ffffffae ffffff89 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffae ffffffbe ffffffe0 ffffffae ffffffb1 ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffb0 ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8a ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffa4 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff95 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffff99 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d 20 32 36 20 2d 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff95 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8d 20 35 Can't encode transcription: 'வயது : 01.01. 2019 ல் # -துணை பட்டியலில் உள்ளவாறு திருத்தப்பட்டுள்ளது மொத்த பக்கங்கள் 26 - பக்கம் 5' in language '' Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Encoding of string failed! Failure bytes: 23 ffffffc2 ffffffa3 34 30 31 30 20 31 36 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffaf ffffff80 ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffff8e ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 2e 20 33 36 20 31 30 31 30 20 31 36 Can't encode transcription: '- 7010 16 வீட்டு எண். % #£4010 16 வீட்டு எண். 36 1010 16' in language '' Compute CTC targets failed! Encoding of string failed! Failure bytes: ffffffe0 ffffffaf ffffff8c ffffffe0 ffffffae ffffffb0 ffffffe0 ffffffae ffffffbf 20 2d Can't encode transcription: 'பெயர்: கீதா - பெயர். கௌரி -' in language '' Encoding of string failed! Failure bytes: 23 30 30 34 30 20 31 38 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffaf ffffff80 ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffff8e ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 2e 20 31 39 20 23 31 30 34 30 20 31 38 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffaf ffffff80 ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffff8e ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 2e 20 31 39 20 31 30 34 30 20 31 38 Can't encode transcription: 'வீட்டு எண். 18 #0040 18 வீட்டு எண். 19 #1040 18 வீட்டு எண். 19 1040 18' in language '' Encoding of string failed! Failure bytes: 5c ffffffe0 ffffffaf ffffffa8 20 7c 20 7c 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffae ffffffaf ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 3a 20 32 32 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffffbe ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffa9 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8d 20 3a ffffffe0 ffffffae ffffffaa ffffffe0 ffffffaf ffffff86 ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 20 26 ffffffe0 ffffffae ffffffb5 20 7c 20 7c 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffae ffffffaf ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 3a 20 32 35 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffffbe ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffa9 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8d 20 3a ffffffe0 ffffffae ffffffaa ffffffe0 ffffffaf ffffff86 ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 20 32 2f 32 31 31 25 30 31 Can't encode transcription: 'வயது: 37 பாலினம் :ஆண் ஃலிஸ்\௨ | | வயது: 22 பாலினம் :பெண் &வ | | வயது: 25 பாலினம் :பெண் 2/211%01' in language '' Encoding of string failed! Failure bytes: 23 30 34 30 20 31 35 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffaf ffffff80 ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffff8e ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 2e 20 31 37 20 23 30 30 34 30 20 31 36 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffaf ffffff80 ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffff8e ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 2e 20 31 33 38 20 37 30 34 30 20 31 35 Can't encode transcription: 'வீட்டு எண். 2 £#040 15 வீட்டு எண். 17 #0040 16 வீட்டு எண். 138 7040 15' in language '' Encoding of string failed! Failure bytes: 23 36 34 30 20 3d 20 7c 20 7c 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffaf ffffff80 ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffff8e ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 2e 20 31 34 2d 26 20 31 ffffffc2 ffffffa3 31 30 34 30 20 31 32 Can't encode transcription: 'குப்புசாமிநாயக்கர் - £1640 | | | வீட்டு எண். 14-& #640 = | | வீட்டு எண். 14-& 1£1040 12' in language '' Compute CTC targets failed! Compute CTC targets failed! Encoding of string failed! Failure bytes: 23 20 2d ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff88 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffaf ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffaf ffffff8d 20 ffffffe0 ffffffae ffffff89 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffae ffffffbe ffffffe0 ffffffae ffffffb1 ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffb0 ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8a ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffa4 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff95 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffff99 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d 20 32 36 20 2d 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff95 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8d 20 36 Can't encode transcription: 'வயது : 01.01. 2019 ல் # -துணை பட்டியலில் உள்ளவாறு திருத்தப்பட்டுள்ளது மொத்த பக்கங்கள் 26 - பக்கம் 6' in language '' Compute CTC targets failed! Encoding of string failed! Failure bytes: 23 20 2d ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff88 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffaf ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffaf ffffff8d 20 ffffffe0 ffffffae ffffff89 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffae ffffffbe ffffffe0 ffffffae ffffffb1 ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffb0 ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8a ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffa4 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff95 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffff99 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d 20 32 36 20 2d 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff95 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8d 20 39 Can't encode transcription: 'வயது : 01.01. 2019 ல் # -துணை பட்டியலில் உள்ளவாறு திருத்தப்பட்டுள்ளது மொத்த பக்கங்கள் 26 - பக்கம் 9' in language '' Compute CTC targets failed! Compute CTC targets failed! Encoding of string failed! Failure bytes: 23 31 30 34 30 20 31 38 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffaf ffffff80 ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffff8e ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 2e 20 33 34 20 31 30 34 30 20 31 36 Can't encode transcription: 'வீட்டு எண். 34 2040 16 வீட்டு எண். 34 #1040 18 வீட்டு எண். 34 1040 16' in language '' Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! This is the output after putting the debug level out2.txt out3.txt out4.txt out5.txt out6.txt out7.txt

Shreeshrii commented 5 years ago

How did you create the box files and lstmf files?

Loaded 54/54 pages (1-54) of document data/ground-truth/out8.lstmf Loaded 20/20 pages (1-20) of document data/ground-truth/tam.Impact_ Condensed.exp0.lstmf Loaded 8/8 pages (1-8) of document data/ground-truth/out5.lstmf Loaded 28/28 pages (1-28) of document data/ground-truth/out2.lstmf Loaded 58/58 pages (1-58) of document data/ground-truth/out6.lstmf Loaded 57/57 pages (1-57) of document data/ground-truth/out3.lstmf Loaded 55/55 pages (1-55) of document data/ground-truth/out9.lstmf Loaded 56/56 pages (1-56) of document data/ground-truth/out4.lstmf

On Mon, Apr 22, 2019 at 10:23 AM nijanthan0 notifications@github.com wrote:

lstmtraining --traineddata data/tamtrain/tamtrain.traineddata --old_traineddata tesseract/tessdata/tam.traineddata --continue_from data/tam/tam.lstm --net_spec '[Lfx256 O1c111]' --model_output data/checkpoints --debug_level -1 --learning_rate 20e-4 --train_listfile data/list.train --eval_listfile data/list.eval --max_iterations 3000 Loaded file data/checkpoints_checkpoint, unpacking... Successfully restored trainer from data/checkpoints_checkpoint Loaded 54/54 pages (1-54) of document data/ground-truth/out8.lstmf Loaded 20/20 pages (1-20) of document data/ground-truth/tam.Impact_Condensed.exp0.lstmf Loaded 8/8 pages (1-8) of document data/ground-truth/out5.lstmf Loaded 28/28 pages (1-28) of document data/ground-truth/out2.lstmf Loaded 58/58 pages (1-58) of document data/ground-truth/out6.lstmf Loaded 57/57 pages (1-57) of document data/ground-truth/out3.lstmf Loaded 55/55 pages (1-55) of document data/ground-truth/out9.lstmf Loaded 56/56 pages (1-56) of document data/ground-truth/out4.lstmf Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Encoding of string failed! Failure bytes: 23 20 2d ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff88 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffaf ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffaf ffffff8d 20 ffffffe0 ffffffae ffffff89 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffae ffffffbe ffffffe0 ffffffae ffffffb1 ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffb0 ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8a ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffa4 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff95 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffff99 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d 20 32 36 20 2d 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff95 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8d 20 35 Can't encode transcription: 'வயது : 01.01. 2019 ல் # -துணை பட்டியலில் உள்ளவாறு திருத்தப்பட்டுள்ளது மொத்த பக்கங்கள் 26 - பக்கம் 5' in language '' Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Encoding of string failed! Failure bytes: 23 ffffffc2 ffffffa3 34 30 31 30 20 31 36 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffaf ffffff80 ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffff8e ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 2e 20 33 36 20 31 30 31 30 20 31 36 Can't encode transcription: '- 7010 16 வீட்டு எண். % #£4010 16 வீட்டு எண். 36 1010 16' in language '' Compute CTC targets failed! Encoding of string failed! Failure bytes: ffffffe0 ffffffaf ffffff8c ffffffe0 ffffffae ffffffb0 ffffffe0 ffffffae ffffffbf 20 2d Can't encode transcription: 'பெயர்: கீதா - பெயர். கௌரி -' in language '' Encoding of string failed! Failure bytes: 23 30 30 34 30 20 31 38 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffaf ffffff80 ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffff8e ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 2e 20 31 39 20 23 31 30 34 30 20 31 38 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffaf ffffff80 ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffff8e ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 2e 20 31 39 20 31 30 34 30 20 31 38 Can't encode transcription: 'வீட்டு எண். 18 #40 https://github.com/tesseract-ocr/tesseract/issues/40 18 வீட்டு எண். 19

1040 https://github.com/tesseract-ocr/tesseract/pull/1040 18 வீட்டு

எண். 19 1040 18' in language '' Encoding of string failed! Failure bytes: 5c ffffffe0 ffffffaf ffffffa8 20 7c 20 7c 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffae ffffffaf ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 3a 20 32 32 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffffbe ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffa9 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8d 20 3a ffffffe0 ffffffae ffffffaa ffffffe0 ffffffaf ffffff86 ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 20 26 ffffffe0 ffffffae ffffffb5 20 7c 20 7c 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffae ffffffaf ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 3a 20 32 35 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffffbe ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffa9 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8d 20 3a ffffffe0 ffffffae ffffffaa ffffffe0 ffffffaf ffffff86 ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 20 32 2f 32 31 31 25 30 31 Can't encode transcription: 'வயது: 37 பாலினம் :ஆண் ஃலிஸ்\௨ | | வயது: 22 பாலினம் :பெண் &வ | | வயது: 25 பாலினம் :பெண் 2/211%01' in language '' Encoding of string failed! Failure bytes: 23 30 34 30 20 31 35 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffaf ffffff80 ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffff8e ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 2e 20 31 37 20 23 30 30 34 30 20 31 36 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffaf ffffff80 ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffff8e ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 2e 20 31 33 38 20 37 30 34 30 20 31 35 Can't encode transcription: 'வீட்டு எண். 2 £#40 https://github.com/tesseract-ocr/tesseract/issues/40 15 வீட்டு எண். 17

40 https://github.com/tesseract-ocr/tesseract/issues/40 16 வீட்டு எண்.

138 7040 15' in language '' Encoding of string failed! Failure bytes: 23 36 34 30 20 3d 20 7c 20 7c 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffaf ffffff80 ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffff8e ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 2e 20 31 34 2d 26 20 31 ffffffc2 ffffffa3 31 30 34 30 20 31 32 Can't encode transcription: 'குப்புசாமிநாயக்கர் - £1640 | | | வீட்டு எண். 14-& #640 https://github.com/tesseract-ocr/tesseract/issues/640 = | | வீட்டு எண். 14-& 1£1040 12' in language '' Compute CTC targets failed! Compute CTC targets failed! Encoding of string failed! Failure bytes: 23 20 2d ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff88 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffaf ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffaf ffffff8d 20 ffffffe0 ffffffae ffffff89 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffae ffffffbe ffffffe0 ffffffae ffffffb1 ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffb0 ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8a ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffa4 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff95 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffff99 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d 20 32 36 20 2d 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff95 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8d 20 36 Can't encode transcription: 'வயது : 01.01. 2019 ல் # -துணை பட்டியலில் உள்ளவாறு திருத்தப்பட்டுள்ளது மொத்த பக்கங்கள் 26 - பக்கம் 6' in language '' Compute CTC targets failed! Encoding of string failed! Failure bytes: 23 20 2d ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff88 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffaf ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffaf ffffff8d 20 ffffffe0 ffffffae ffffff89 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffae ffffffbe ffffffe0 ffffffae ffffffb1 ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffb0 ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8a ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffa4 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff95 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffff99 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d 20 32 36 20 2d 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff95 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8d 20 39 Can't encode transcription: 'வயது : 01.01. 2019 ல் # -துணை பட்டியலில் உள்ளவாறு திருத்தப்பட்டுள்ளது மொத்த பக்கங்கள் 26 - பக்கம் 9' in language '' Compute CTC targets failed! Compute CTC targets failed! Encoding of string failed! Failure bytes: 23 31 30 34 30 20 31 38 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffaf ffffff80 ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffff8e ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 2e 20 33 34 20 31 30 34 30 20 31 36 Can't encode transcription: 'வீட்டு எண். 34 2040 16 வீட்டு எண். 34 #1040 https://github.com/tesseract-ocr/tesseract/pull/1040 18 வீட்டு எண். 34 1040 16' in language '' Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! This is the output after putting the debug level out2.txt https://github.com/tesseract-ocr/tesseract/files/3102202/out2.txt out3.txt https://github.com/tesseract-ocr/tesseract/files/3102203/out3.txt out4.txt https://github.com/tesseract-ocr/tesseract/files/3102204/out4.txt out5.txt https://github.com/tesseract-ocr/tesseract/files/3102205/out5.txt out6.txt https://github.com/tesseract-ocr/tesseract/files/3102206/out6.txt out7.txt https://github.com/tesseract-ocr/tesseract/files/3102207/out7.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2395#issuecomment-485324445, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG37I5R2EXKH757XJ3MES3PRVAGNANCNFSM4HHCRYLA .

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

nijanthan0 commented 5 years ago

I created box file using lsmbox comment and lstmf using lstm.train

Shreeshrii commented 5 years ago

What about Loaded 20/20 pages (1-20) of document data/ground-truth/tam.Impact_Condensed.exp0.lstmf?

Impact_condensed font does not support Tamil?

The problem is related to your input files. Please share training text or image and box pair.

On Mon, Apr 22, 2019 at 10:56 AM nijanthan0 notifications@github.com wrote:

I created box file using lsmbox comment and lstmf using lstm.train

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2395#issuecomment-485328885, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG37I4GW5VGG45O6POSAZDPRVEAHANCNFSM4HHCRYLA .

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

nijanthan0 commented 5 years ago

Mam but for --eval_listfile I don't know what to give as input so i manually created one impact_condensed font file and then stored in eval listfile.

this is my files. ground-truth.zip

Shreeshrii commented 5 years ago

Mam but for --eval_listfile I don't know what to give as input so i manually created one impact_condensed font file and then stored in eval listfile.

You have a large number of training files, use one of them for eval (eg. ocr2).

I am wondering whether Compute CTC targets failed! is related to the impact_condensed eval file.

I will test further with all the files you sent and get back.

nijanthan0 commented 5 years ago

No mam "Compute CTC targets failed!" is not related to impact_condensed eval file.

Shreeshrii commented 5 years ago

this is my files. ground-truth.zip

The zip file has the OCRed text for the images. The ground truth needs to be the correct transcription for the images.

nijanthan0 commented 5 years ago

But I am not using text file in the training process.

Shreeshrii commented 5 years ago

Training uses box/tiff pairs for creating the lstmf files. If you give the wrong text for an image then all training will be wrong. Your box files also hold incorrect text only.

Shreeshrii commented 5 years ago

I tested by using the wordstrbox (without correcting the text).

lstmtraining \

--model_output build/poll \ --continue_from ~/tessdata_best/script/Tamil.lstm \ --traineddata ~/tessdata_best/script/Tamil.traineddata \ --train_listfile build/tam.poll.training_files.txt \ --debug_interval -1 Loaded file /home/ubuntu/tessdata_best/script/Tamil.lstm, unpacking... Warning: LSTMTrainer deserialized an LSTMRecognizer! Continuing from /home/ubuntu/tessdata_best/script/Tamil.lstm Loaded 54/54 lines (1-54) of document input/out10.lstmf Loaded 53/53 lines (1-53) of document input/out11.lstmf Loaded 8/8 lines (1-8) of document input/out5.lstmf Loaded 57/57 lines (1-57) of document input/out7.lstmf Loaded 57/57 lines (1-57) of document input/out3.lstmf Loaded 54/54 lines (1-54) of document input/out8.lstmf Loaded 56/56 lines (1-56) of document input/out4.lstmf Loaded 58/58 lines (1-58) of document input/out6.lstmf Loaded 55/55 lines (1-55) of document input/out9.lstmf Iteration 0: GROUND TRUTH : பெயர்‌: மோகனா - பெயர்‌: மாதவன்‌ - பெயர்‌: தமிழ்ச்செல்வி - Iteration 0: ALIGNED TRUTH : பெயர்‌: மோகனா - பெயர்‌: மாதவன்‌ - பெயர்‌: தமிழ்ச்செல்வி - Iteration 0: BEST OCR TEXT : பெயர்‌: மோகனா - F|பெயர்‌: மாதவன்‌- . |-|பெயர்‌: தமிழ்ச்செல்வி - File input/out10.lstmf line 0 : Mean rms=2.079%, delta=6.897%, train=11.111%(44.444%), skip ratio=0% Iteration 1: GROUND TRUTH : சட்டமன்றத்‌ தொகுதி எண்‌ மற்றும்‌ பெயர்‌ : 36-உத்திரமேரூர்‌ பாகம்‌ எண்‌: 1 Iteration 1: ALIGNED TRUTH : சட்டமன்றத்‌ தொகுதி எண்‌ மற்றும்‌ பெயர்‌ : 36-உத்திரமேரூர்‌ பாகம்‌ எண்‌: Iteration 1: BEST OCR TEXT : சட்டமன்றத்‌ தொகுதி எண்‌ மற்றும்‌ பெயர்‌ : %-உத்திரமேரர்‌ .......எபாகம்‌ எண்‌: 1 File input/out11.lstmf line 0 : Mean rms=2.03%, delta=4.742%, train=15.257%(32.222%), skip ratio=0% Iteration 2: GROUND TRUTH : பெயர்‌: கிருட்டினன்‌ - பெயர்‌: வேதவல்லி - பெயர்‌: குப்பன்‌ - Iteration 2: ALIGNED TRUTH : பெயர்‌: கிருட்டினன்‌ - பெயர்‌: வேதவல்லி - பெயர்‌: குப்பன்‌ - Iteration 2: BEST OCR TEXT : பெயர்‌: கிருட்டினன்‌- [பெயர்‌: வேதவல்லி- [பெயர்‌: குப்பன்‌ - File input/out3.lstmf line 0 : Mean rms=1.848%, delta=4.556%, train=13.148%(43.704%), skip ratio=0% Iteration 3: GROUND TRUTH : கணவர்‌ பெயர்‌: முருகன்‌ - தந்தை பெயர்‌: காசி - தந்தை பெயர்‌: இராமன்‌ - Iteration 3: ALIGNED TRUTH : கணவர்‌ பெயர்‌: முருகன்‌ - தந்தை பெயர்‌: காசி - தந்தை பெயர்‌: இராமன்‌ - Iteration 3: BEST OCR TEXT : கணவர்‌ பெயர்‌: முருகன்‌- | ந|[தந்தைபெயர்‌ காச- [ ந|தந்தை பெயர்‌: இராமன்‌ - File input/out4.lstmf line 0 : Mean rms=1.916%, delta=4.98%, train=13.707%(47.361%), skip ratio=0% Iteration 4: GROUND TRUTH : பெயர்‌: கீதா - பெயர்‌: கெளரி - Iteration 4: ALIGNED TRUTH : பெயர்‌: கீதா - பெயர்‌: கெளரி - Iteration 4: BEST OCR TEXT : பெயர்‌. கதா - |[பெயர்‌: கெளரி - File input/out5.lstmf line 0 : Mean rms=1.894%, delta=4.984%, train=15.103%(47.889%), skip ratio=0% Iteration 5: GROUND TRUTH : தந்த பெயர்‌: குட்டியப்பன்‌ - கணவர்‌ பெயர்‌: முனுசாமி - தந்த பெயர்‌: கன்னியப்பன்‌ - Iteration 5: ALIGNED TRUTH : தந்த பெயர்‌: குட்டியப்பன்‌ - கணவர்‌ பெயர்‌: முனுசாமி - தந்த பெயர்‌: கன்னியப்பன்‌ - Iteration 5: BEST OCR TEXT : தந்த பெயர்‌: குட்டியப்பன்‌- | |[|கணவர்‌ பெயர்‌: முனுசாமி- | |[[தந்தை பெயர்‌: கன்னியப்பன்‌ - File input/out6.lstmf line 0 : Mean rms=1.904%, delta=5.48%, train=14.751%(48.241%), skip ratio=0% Iteration 6: GROUND TRUTH : வயது: ஏ பாலினம்‌ :ஆண்‌ வயது: % பாலினம்‌ ஆண்‌ வயது: 4 பாலினம்‌ :பெண்‌ Iteration 6: BEST OCR TEXT : வயது: ஏ பாலினம்‌ :ஆண்‌ | Aிleble ||வயது: % பாலினம்‌ :ஆண்‌ | rileble [வயது: & பாலினம்‌ :-பெண்‌ File input/out7.lstmf line 0 : Mean rms=2.04%, delta=7.057%, train=18.539%(47.302%), skip ratio=0% Iteration 7: GROUND TRUTH : வயது: ஏ பாலினம்‌ :ஆண்‌ Available ||வயது: 22 பாலினம்‌ பெண்‌ Available ||வயது: 25 பாலினம்‌ பெண்‌ Available Iteration 7: ALIGNED TRUTH : வயது: ஏ பாலினம்‌ :ஆண்‌ Available ||வயது: 22 பாலினம்‌ பெண்‌ Avllable ||வயது: 25 பாலினம்‌ பெண்‌ Available Iteration 7: BEST OCR TEXT : வயது: ஏ பாலினம்‌ :ஆண்‌ | இவிஷ்ீ |[வயது: 2 பாலினம்‌ பெண்‌ | விஷ்ச [[வயது: 25 பாலினம்‌ -பெண்‌ | vailable File input/out8.lstmf line 0 : Mean rms=2.089%, delta=7.843%, train=20.893%(47.222%), skip ratio=0% Iteration 8: GROUND TRUTH : TRQO0226621 TN/O5/026/0393067 TN/O5/026/0393295 Iteration 8: ALIGNED TRUTH : TRQO0226621 TN/O5/026/0393067 TN/O5/026/0393295 Iteration 8: BEST OCR TEXT : TRQ0226621|[ V TNOSIO260393067 [ TNO5I026/0393295 File input/out9.lstmf line 0 : Mean rms=2.099%, delta=8.025%, train=22.507%(53.086%), skip ratio=0% Iteration 9: GROUND TRUTH : வீட்டு எண்‌: 41 Photo is வீட்டு எண்‌: 41 Photo is வீட்டு எண்‌: 41 Photo is Iteration 9: ALIGNED TRUTH : வீட்டு எண்‌: 41 Photo is வீட்டு எண்‌: 441 Photo is வீட்டு எண்‌: 41 Photo is Iteration 9: BEST OCR TEXT : வீட்டுஎண்‌4 | Photois |வீட்டுஎண்‌்4 | Photois |வீட்டுஎண்‌41 | Photos File input/out10.lstmf line 1 : Mean rms=2.091%, delta=8.12%, train=22.618%(57.778%), skip ratio=0% Iteration 10: GROUND TRUTH : தந்த பெயர்‌: சின்னபையன்‌ - தந்த பெயர்‌: சின்னபையன்‌ - கணவர்‌ பெயர்‌: சங்கர்‌ - Iteration 10: ALIGNED TRUTH : தந்த பெயர்‌: சின்னபையன்‌ - தந்த பெயர்‌: சின்னபையன்‌ - கணவர்‌ பெயர்‌: சங்கர்‌ - Iteration 10: BEST OCR TEXT : தந்த பெயர்‌: சின்னபையன்‌- | |தந்தை பெயர்‌: சின்னபையன்‌ - | [கணவர்‌ பெயர்‌: சங்கர்‌ - File input/out11.lstmf line 1 : Mean rms=2.046%, delta=7.87%, train=21.193%(55.556%), skip ratio=0% Iteration 11: GROUND TRUTH : - Photo is வீட்டு எண்‌: 4 Photo is வீட்டு எண்‌: 4 Photo is Iteration 11: ALIGNED TRUTH : ------------- Photo is வீட்டு எண்‌: 4 Photo is வீட்டு எண்‌: 4 444 Photo is Iteration 11: BEST OCR TEXT : D [ Photois வீட்டுஎண்‌:4 Photois |[வீட்டுிஎண்‌:4: | Photois

Shreeshrii commented 5 years ago

tamil.zip

This zip file has box files for your images in wordstr format. The text for each line needs to be corrected to match the image. Then you can use these box files with your images to create the lstmf files and then use them for lstmtraining.

However, some errors maybe because of incorrect layout analysis and more training will not fix those.

You need to use some other method, opencv, uzn etc to mark areas and then recognize them separately.

Shreeshrii commented 5 years ago

Anyway, in all my testing, didn't get the error Compute CTC targets failed!

nijanthan0 commented 5 years ago

Can we directly use the wordstr box file for training?

Shreeshrii commented 5 years ago

The text for each line needs to be corrected to match the image

The wordstr box file can be used for training AFTER you review and correct the text for each line. Currently it has been generated using the existing Tamil traineddata so it will have all errors that you see in recognition. For training you need to correct that text so that it matches the image.

Test with one file, use debug_level -1 to make sure it looks ok. Then apply to all images.

nijanthan0 commented 5 years ago

Mam, thank you for your help.But i have one problem.

lstmtraining --traineddata data/tamiltest/tamiltest.traineddata --old_traineddata tesseract/tessdata/tam.traineddata --continue_from data/tam/tam.lstm --perfect_sample_delay 0 --target_error_rate 0.01 --model_output data/checkpoints --debug_level -1 --train_listfile data/list.train --eval_listfile data/list.eval --max_iterations 10000 Loaded file data/checkpoints_checkpoint, unpacking... Successfully restored trainer from data/checkpoints_checkpoint Loaded 46/46 pages (1-46) of document data/ground-truth/out25.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out26.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out20.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out22.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out23.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out27.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out18.lstmf Loaded 35/35 pages (1-35) of document data/ground-truth/out21.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out29.lstmf Loaded 56/56 pages (1-56) of document data/ground-truth/out8.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out14.lstmf At iteration 10000/10000/10000, Mean rms=5.03%, delta=48.282%, char train=98.223%, word train=97.784%, skip ratio=0%, New worst char error = 98.223 wrote checkpoint.

Finished! Error rate = 98.136

nijanthan0 commented 5 years ago

How do i reduce the error rate ??

Shreeshrii commented 5 years ago

If you are running with --debug_level -1 you will have details of every iteration. Usually the error rate will keep going down.

It seems to me that you are training with about 500 lines of text.

Are you getting any errors during training? Run for --max_iterations 200 and look at the console log.

On Thu, Apr 25, 2019 at 12:21 PM nijanthan0 notifications@github.com wrote:

How do i reduce the error rate ??

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2395#issuecomment-486542215, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG37I4OGA2THOB7GDW2RW3PSFIIHANCNFSM4HHCRYLA .

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

nijanthan0 commented 5 years ago

lstmtraining --traineddata data/tamiltest/tamiltest.traineddata --old_traineddata data/tam/tam.traineddata --continue_from data/tam/tam.lstm --perfect_sample_delay 0 --target_error_rate 0.01 --model_output data/checkpoints --debug_level -1 --train_listfile data/list.train --eval_listfile data/list.eval --max_iterations 200 Loaded file data/tam/tam.lstm, unpacking... Warning: LSTMTrainer deserialized an LSTMRecognizer! Code range changed from 99 to 145! Num (Extended) outputs,weights in Series: 1,36,0,1:1, 0 Num (Extended) outputs,weights in Series: C3,3:9, 0 Ft16:16, 160 Total weights = 160 [C3,3Ft16]:16, 160 Mp3,3:16, 0 Lfys48:48, 12480 Lfx96:96, 55680 Lrx96:96, 74112 Lfx192:192, 221952 Fc145:145, 27985 Total weights = 392369 Previous null char=2 mapped to 144 Continuing from data/tam/tam.lstm Loaded 46/46 pages (1-46) of document data/ground-truth/out25.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out26.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out27.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out22.lstmf Loaded 35/35 pages (1-35) of document data/ground-truth/out21.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out23.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out18.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out20.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out29.lstmf Loaded 56/56 pages (1-56) of document data/ground-truth/out8.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out14.lstmf Loaded 35/35 pages (1-35) of document data/ground-truth/out1.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out24.lstmf Loaded 55/55 pages (1-55) of document data/ground-truth/out13.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out16.lstmf Loaded 35/35 pages (1-35) of document data/ground-truth/out6.lstmf Loaded 35/35 pages (1-35) of document data/ground-truth/out9.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out30.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out11.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out12.lstmf Loaded 45/45 pages (1-45) of document data/ground-truth/out10.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out7.lstmf Loaded 35/35 pages (1-35) of document data/ground-truth/out19.lstmf Loaded 47/47 pages (1-47) of document data/ground-truth/out28.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out17.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out5.lstmf Loaded 57/57 pages (1-57) of document data/ground-truth/out15.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out4.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out3.lstmf At iteration 100/100/100, Mean rms=5.965%, delta=66.111%, char train=154.374%, word train=99.521%, skip ratio=0%, New worst char error = 154.374 wrote checkpoint.

At iteration 200/200/200, Mean rms=6.777%, delta=86.539%, char train=165.383%, word train=99.594%, skip ratio=0%, New worst char error = 165.383 wrote checkpoint.

Finished! Error rate = 100

nijanthan0 commented 5 years ago

I didn't get any error during training.

nijanthan0 commented 5 years ago

I first extracted data file from image using " Tamil " tessdata and then i corrected the values of the text file. Then using the text file I created box file and tif file with help of text2image. Then I used " tam " tessdata for other training purpose(like unicharset,lstm training). Is this causes of high error rate?

Shreeshrii commented 5 years ago

Your images have English in them. If you want that to be recognized it needs to be in your unicharset.

The tam.traineddata has a limited unicharset. By using that, a larger number of characters have to be added.

Try using Tamil.traineddata for further training and see if that is better.

I am not sure why you are not getting debug msgs on screen.

On Thu, 25 Apr 2019, 13:56 nijanthan0, notifications@github.com wrote:

I first extracted data file from image using " Tamil " tessdata and then i corrected the values of the text file. Then using the text file I created box file and tif file with help of text2image. Then I used " tam " tessdata for other training purpose(like unicharset,lstm training). Is this causes of high error rate?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2395#issuecomment-486570179, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG37IZ3W4U5BUFXYK6AGBDPSFTMHANCNFSM4HHCRYLA .

Shreeshrii commented 5 years ago

Code range changed from 99 to 145!

tam.unicharset is 99, your text has 145 unichars.

On Thu, Apr 25, 2019 at 2:13 PM Shree Devi Kumar shreeshrii@gmail.com wrote:

Your images have English in them. If you want that to be recognized it needs to be in your unicharset.

The tam.traineddata has a limited unicharset. By using that, a larger number of characters have to be added.

Try using Tamil.traineddata for further training and see if that is better.

I am not sure why you are not getting debug msgs on screen.

On Thu, 25 Apr 2019, 13:56 nijanthan0, notifications@github.com wrote:

I first extracted data file from image using " Tamil " tessdata and then i corrected the values of the text file. Then using the text file I created box file and tif file with help of text2image. Then I used " tam " tessdata for other training purpose(like unicharset,lstm training). Is this causes of high error rate?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2395#issuecomment-486570179, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG37IZ3W4U5BUFXYK6AGBDPSFTMHANCNFSM4HHCRYLA .

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shreeshrii commented 5 years ago

--debug_interval -1

It is interval not level. -1 is minus one

On Thu, Apr 25, 2019 at 2:18 PM Shree Devi Kumar shreeshrii@gmail.com wrote:

Code range changed from 99 to 145!

tam.unicharset is 99, your text has 145 unichars.

On Thu, Apr 25, 2019 at 2:13 PM Shree Devi Kumar shreeshrii@gmail.com wrote:

Your images have English in them. If you want that to be recognized it needs to be in your unicharset.

The tam.traineddata has a limited unicharset. By using that, a larger number of characters have to be added.

Try using Tamil.traineddata for further training and see if that is better.

I am not sure why you are not getting debug msgs on screen.

On Thu, 25 Apr 2019, 13:56 nijanthan0, notifications@github.com wrote:

I first extracted data file from image using " Tamil " tessdata and then i corrected the values of the text file. Then using the text file I created box file and tif file with help of text2image. Then I used " tam " tessdata for other training purpose(like unicharset,lstm training). Is this causes of high error rate?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2395#issuecomment-486570179, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG37IZ3W4U5BUFXYK6AGBDPSFTMHANCNFSM4HHCRYLA .

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

nijanthan0 commented 5 years ago

lstmtraining --traineddata data/tamiltest/tamiltest.traineddata --old_traineddata data/tam/Tamil.traineddata --continue_from data/tam/tam.lstm --perfect_sample_delay 0 --target_error_rate 0.01 --model_output data/checkpoints --debug_interval -1 --train_listfile data/list.train --max_iterations 200 Loaded file data/checkpoints_checkpoint, unpacking... Code range changed from 117 to 173! Must supply the old traineddata for code conversion! Loaded file data/tam/tam.lstm, unpacking... Warning: LSTMTrainer deserialized an LSTMRecognizer! Code range changed from 99 to 173! Num (Extended) outputs,weights in Series: 1,36,0,1:1, 0 Num (Extended) outputs,weights in Series: C3,3:9, 0 Ft16:16, 160 Total weights = 160 [C3,3Ft16]:16, 160 Mp3,3:16, 0 Lfys48:48, 12480 Lfx96:96, 55680 Lrx96:96, 74112 Lfx192:192, 221952 Fc99:99, 19107 Total weights = 383491 Previous null char=2 mapped to 172 Continuing from data/tam/tam.lstm Loaded 25/25 pages (1-25) of document data/ground-truth/out34.lstmf Loaded 23/23 pages (1-23) of document data/ground-truth/out32.lstmf Loaded 20/20 pages (1-20) of document data/ground-truth/out31.lstmf Loaded 23/23 pages (1-23) of document data/ground-truth/out35.lstmf lstmtraining: ../../src/ccutil/genericvector.h:724: T& GenericVector::operator const [with T = int]: Assertion `index >= 0 && index < sizeused' failed. Aborted (core dumped)

"If i use Tamil.traineddata in the old trained data i get an error and also i used Tamil.lstm unicharset"

nijanthan0 commented 5 years ago

--debug_interval -1 It is interval not level. -1 is minus one On Thu, Apr 25, 2019 at 2:18 PM Shree Devi Kumar shreeshrii@gmail.com wrote:

Code range changed from 99 to 145! tam.unicharset is 99, your text has 145 unichars. On Thu, Apr 25, 2019 at 2:13 PM Shree Devi Kumar @.> wrote: > Your images have English in them. If you want that to be recognized it > needs to be in your unicharset. > > The tam.traineddata has a limited unicharset. By using that, a larger > number of characters have to be added. > > Try using Tamil.traineddata for further training and see if that is > better. > > I am not sure why you are not getting debug msgs on screen. > > > On Thu, 25 Apr 2019, 13:56 nijanthan0, @.> wrote: > >> I first extracted data file from image using " Tamil " tessdata and then >> i corrected the values of the text file. Then using the text file I created >> box file and tif file with help of text2image. Then I used " tam " tessdata >> for other training purpose(like unicharset,lstm training). Is this causes >> of high error rate? >> >> — >> You are receiving this because you commented. >> Reply to this email directly, view it on GitHub >> <#2395 (comment)>, >> or mute the thread >> https://github.com/notifications/unsubscribe-auth/ABG37IZ3W4U5BUFXYK6AGBDPSFTMHANCNFSM4HHCRYLA >> . >> > -- ____ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

____ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

yes, Thank You, Now only i changed it ...😊

Shreeshrii commented 5 years ago

--old_traineddata data/tam/Tamil.traineddata --continue_from data/tam/tam.lstm

Both need to be in sync.

Tamil.traineddata Tamil.lstm

nijanthan0 commented 5 years ago

Sorry I used fast tessdata of " Tamil.trainedata". Now only i am using best tessdata

nijanthan0 commented 5 years ago

lstmtraining --traineddata data/tamiltest/tamiltest.traineddata --old_traineddata tesseract/tessdata/Tamil.traineddata --continue_from data/Tamil/Tamil.lstm --perfect_sample_delay 0 --target_error_rate 0.01 --model_output data/checkpoints --debug_interval -1 --train_listfile data/list.train --max_iterations 200 Loaded file data/checkpoints_checkpoint, unpacking... Successfully restored trainer from data/checkpoints_checkpoint Loaded 20/20 pages (1-20) of document data/ground-truth/out31.lstmf Loaded 23/23 pages (1-23) of document data/ground-truth/out35.lstmf Loaded 23/23 pages (1-23) of document data/ground-truth/out32.lstmf Loaded 24/24 pages (1-24) of document data/ground-truth/out33.lstmf At iteration 200/200/200, Mean rms=5.646%, delta=70.988%, char train=138.671%, word train=99.23%, skip ratio=0%, New worst char error = 138.671 wrote checkpoint.

Finished! Error rate = 100

"In This also 100 error rate"

Shreeshrii commented 5 years ago

try with command similar to what i used - see https://github.com/tesseract-ocr/tesseract/issues/2395#issuecomment-485419395

nijanthan0 commented 5 years ago

Same error

Shreeshrii commented 5 years ago

What does same error mean?

200 iterations was to test what was going wrong. Now you can train for more iterations.

For impact style fine tuning try 400-600 iterations.

For plus type fine tuning try 3000-3600.

YuTingLiu commented 4 years ago

Is this caused by the parameters of x_size not same in data generation and train?

stweil commented 3 years ago

Pull request #3251 improves the error message for "Compute CTC target failed" and now shows the lstmf file which is triggering that error. One possible reason for that error is a rotated text line.

wolfassi123 commented 2 years ago

@Shreeshrii I am always getting this error when I'm trying to train for Arabic. I am adding my own data in the "training_text" file and it consists of a lot of arabic numbers and dates. I am constantly getting this issue. But I need to train the model into recognizing such dates and numbers, I'd rather not use different trained data, one for numbers and one for words. Any idea how to solve such an issue?

drdmitry commented 2 years ago

I had a similar issue ("Compute CTC targets failed!") when I generated two lstmf files from a different type of .box files. One box file was generated with boxes as full-width horizontal lines of text. Another box file was generated with boxes for each particular letter of the text. I had to regenerate box files (train and eval) using the same type of --psm parameters, and after that, the training went smoothly.

karan00713 commented 1 year ago

Hi i'm trying lstmtraining for tamil text, i'm facing compute ctc error Compute CTC targets failed for /home/user/Aadhar/data/Aadhar-ground-truth/1.lstmf! (=57 On [0, 2), scores= 1.12(:=59=1.11) 1.13(ஏ=64=1.12), Mean=1.12237, max=1.12645 ஏ=64 On [2, 4), scores= 1.13((=57=1.13) 1.13((=57=1.13), Mean=1.12687, max=1.1281 (=57 On [4, 6), scores= 1.13(ஏ=64=1.13) 1.14(ஏ=64=1.13), Mean=1.1351, max=1.13527 ஏ=64 On [6, 8), scores= 1.13((=57=1.14) 1.13((=57=1.14), Mean=1.12844, max=1.12863 (=57 On [8, 11), scores= 1.14(ஏ=64=1.13) 1.14(ஹ=66=1.13) 1.14(ஹ=66=1.13), Mean=1.13606, max=1.13633 Compute CTC targets failed for /home/user/Aadhar/data/Aadhar-ground-truth/2.lstmf! (=57 On [0, 2), scores= 1.12(:=59=1.11) 1.13(ஏ=64=1.12), Mean=1.12268, max=1.12686 ஏ=64 On [2, 4), scores= 1.12((=57=1.13) 1.13((=57=1.13), Mean=1.12615, max=1.12734 (=57 On [4, 6), scores= 1.14(ஏ=64=1.13) 1.14(ஏ=64=1.13), Mean=1.13574, max=1.13588 ஏ=64 On [6, 8), scores= 1.13((=57=1.14) 1.13((=57=1.14), Mean=1.12798, max=1.12808 (=57 On [8, 10), scores= 1.14(ஏ=64=1.13) 1.14(ஏ=64=1.13), Mean=1.13546, max=1.13548 ஹ=66 On [10, 12), scores= 1.13((=57=1.14) 1.13((=57=1.14), Mean=1.12787, max=1.12802 (=57 On [12, 14), scores= 1.14(ஹ=66=1.13) 1.14(ஹ=66=1.13), Mean=1.13633, max=1.13645 Compute CTC targets failed for /home/user/Aadhar/data/Aadhar-ground-truth/3.lstmf! (=57 On [0, 2), scores= 1.12(:=59=1.11) 1.13(ஏ=64=1.12), Mean=1.12269, max=1.12686 ஏ=64 On [2, 4), scores= 1.13((=57=1.13) 1.13((=57=1.13), Mean=1.12622, max=1.12741 (=57 On [4, 6), scores= 1.14(ஏ=64=1.13) 1.14(ஏ=64=1.13), Mean=1.13572, max=1.13585 ஏ=64 On [6, 8), scores= 1.13((=57=1.14) 1.13((=57=1.14), Mean=1.12804, max=1.12813 (=57 On [8, 10), scores= 1.14(ஏ=64=1.13) 1.14(ஏ=64=1.13), Mean=1.13534, max=1.13538 ஹ=66 On [10, 12), scores= 1.13((=57=1.14) 1.13((=57=1.14), Mean=1.12775, max=1.12785 (=57 On [12, 15), scores= 1.14(ஹ=66=1.13) 1.14(ஹ=66=1.13) 1.14(ஹ=66=1.13), Mean=1.13621, max=1.13649

MikhailesU commented 1 year ago

В обучении для создания файлов lstmf используются пары box/tiff. Если вы дадите неверный текст для изображения, то все обучение будет неправильным. Файлы вашего ящика также содержат только неверный текст.

that is, if I want to train the tesseract on text that it cannot see in the image, it will throw this error?

stweil commented 1 year ago

@karan00713, @kiberchert, recent software versions report the line image which caused the message. I suggest to visually inspect such images whether they are reasonable (not more than a single line, not rotated) and compare whether line image and line transcription match.

DesBw commented 1 year ago

This error occurs on my pc when text2image created empty tif files. The .lstmf files created out of those empty tif files trigger the error. It looks like text2image has a lot of bugs. It created empty box files, as well as empty image files.