Open nijanthan0 opened 5 years ago
--old_traineddata tesseract/tessdata/tam.traineddata
Is this file taken from tessdata_best repo?
lstmtraining --traineddata data/tamtrain/tamtrain.traineddata --old_traineddata tesseract/tessdata/tam.traineddata --continue_from data/tam/tam.lstm --net_spec '[Lfx256 O1c111]' --model_output data/checkpoints --learning_rate 20e-4 --train_listfile data/list.train --eval_listfile data/list.eval --max_iterations 3000
Run your command with --debug_level -1
and share console output and also the training_text used.
yes, I am using best tess data
lstmtraining --traineddata data/tamtrain/tamtrain.traineddata --old_traineddata tesseract/tessdata/tam.traineddata --continue_from data/tam/tam.lstm --net_spec '[Lfx256 O1c111]' --model_output data/checkpoints --debug_level -1 --learning_rate 20e-4 --train_listfile data/list.train --eval_listfile data/list.eval --max_iterations 3000 Loaded file data/checkpoints_checkpoint, unpacking... Successfully restored trainer from data/checkpoints_checkpoint Loaded 54/54 pages (1-54) of document data/ground-truth/out8.lstmf Loaded 20/20 pages (1-20) of document data/ground-truth/tam.Impact_Condensed.exp0.lstmf Loaded 8/8 pages (1-8) of document data/ground-truth/out5.lstmf Loaded 28/28 pages (1-28) of document data/ground-truth/out2.lstmf Loaded 58/58 pages (1-58) of document data/ground-truth/out6.lstmf Loaded 57/57 pages (1-57) of document data/ground-truth/out3.lstmf Loaded 55/55 pages (1-55) of document data/ground-truth/out9.lstmf Loaded 56/56 pages (1-56) of document data/ground-truth/out4.lstmf Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Encoding of string failed! Failure bytes: 23 20 2d ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff88 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffaf ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffaf ffffff8d 20 ffffffe0 ffffffae ffffff89 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffae ffffffbe ffffffe0 ffffffae ffffffb1 ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffb0 ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8a ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffa4 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff95 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffff99 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d 20 32 36 20 2d 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff95 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8d 20 35 Can't encode transcription: 'வயது : 01.01. 2019 ல் # -துணை பட்டியலில் உள்ளவாறு திருத்தப்பட்டுள்ளது மொத்த பக்கங்கள் 26 - பக்கம் 5' in language '' Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Encoding of string failed! Failure bytes: 23 ffffffc2 ffffffa3 34 30 31 30 20 31 36 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffaf ffffff80 ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffff8e ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 2e 20 33 36 20 31 30 31 30 20 31 36 Can't encode transcription: '- 7010 16 வீட்டு எண். % #£4010 16 வீட்டு எண். 36 1010 16' in language '' Compute CTC targets failed! Encoding of string failed! Failure bytes: ffffffe0 ffffffaf ffffff8c ffffffe0 ffffffae ffffffb0 ffffffe0 ffffffae ffffffbf 20 2d Can't encode transcription: 'பெயர்: கீதா - பெயர். கௌரி -' in language '' Encoding of string failed! Failure bytes: 23 30 30 34 30 20 31 38 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffaf ffffff80 ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffff8e ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 2e 20 31 39 20 23 31 30 34 30 20 31 38 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffaf ffffff80 ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffff8e ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 2e 20 31 39 20 31 30 34 30 20 31 38 Can't encode transcription: 'வீட்டு எண். 18 #0040 18 வீட்டு எண். 19 #1040 18 வீட்டு எண். 19 1040 18' in language '' Encoding of string failed! Failure bytes: 5c ffffffe0 ffffffaf ffffffa8 20 7c 20 7c 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffae ffffffaf ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 3a 20 32 32 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffffbe ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffa9 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8d 20 3a ffffffe0 ffffffae ffffffaa ffffffe0 ffffffaf ffffff86 ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 20 26 ffffffe0 ffffffae ffffffb5 20 7c 20 7c 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffae ffffffaf ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 3a 20 32 35 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffffbe ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffa9 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8d 20 3a ffffffe0 ffffffae ffffffaa ffffffe0 ffffffaf ffffff86 ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 20 32 2f 32 31 31 25 30 31 Can't encode transcription: 'வயது: 37 பாலினம் :ஆண் ஃலிஸ்\௨ | | வயது: 22 பாலினம் :பெண் &வ | | வயது: 25 பாலினம் :பெண் 2/211%01' in language '' Encoding of string failed! Failure bytes: 23 30 34 30 20 31 35 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffaf ffffff80 ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffff8e ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 2e 20 31 37 20 23 30 30 34 30 20 31 36 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffaf ffffff80 ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffff8e ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 2e 20 31 33 38 20 37 30 34 30 20 31 35 Can't encode transcription: 'வீட்டு எண். 2 £#040 15 வீட்டு எண். 17 #0040 16 வீட்டு எண். 138 7040 15' in language '' Encoding of string failed! Failure bytes: 23 36 34 30 20 3d 20 7c 20 7c 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffaf ffffff80 ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffff8e ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 2e 20 31 34 2d 26 20 31 ffffffc2 ffffffa3 31 30 34 30 20 31 32 Can't encode transcription: 'குப்புசாமிநாயக்கர் - £1640 | | | வீட்டு எண். 14-& #640 = | | வீட்டு எண். 14-& 1£1040 12' in language '' Compute CTC targets failed! Compute CTC targets failed! Encoding of string failed! Failure bytes: 23 20 2d ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff88 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffaf ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffaf ffffff8d 20 ffffffe0 ffffffae ffffff89 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffae ffffffbe ffffffe0 ffffffae ffffffb1 ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffb0 ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8a ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffa4 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff95 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffff99 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d 20 32 36 20 2d 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff95 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8d 20 36 Can't encode transcription: 'வயது : 01.01. 2019 ல் # -துணை பட்டியலில் உள்ளவாறு திருத்தப்பட்டுள்ளது மொத்த பக்கங்கள் 26 - பக்கம் 6' in language '' Compute CTC targets failed! Encoding of string failed! Failure bytes: 23 20 2d ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff88 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffaf ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffaf ffffff8d 20 ffffffe0 ffffffae ffffff89 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffae ffffffbe ffffffe0 ffffffae ffffffb1 ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffb0 ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8a ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffa4 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff95 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffff99 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d 20 32 36 20 2d 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff95 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8d 20 39 Can't encode transcription: 'வயது : 01.01. 2019 ல் # -துணை பட்டியலில் உள்ளவாறு திருத்தப்பட்டுள்ளது மொத்த பக்கங்கள் 26 - பக்கம் 9' in language '' Compute CTC targets failed! Compute CTC targets failed! Encoding of string failed! Failure bytes: 23 31 30 34 30 20 31 38 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffaf ffffff80 ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffff8e ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 2e 20 33 34 20 31 30 34 30 20 31 36 Can't encode transcription: 'வீட்டு எண். 34 2040 16 வீட்டு எண். 34 #1040 18 வீட்டு எண். 34 1040 16' in language '' Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! This is the output after putting the debug level out2.txt out3.txt out4.txt out5.txt out6.txt out7.txt
How did you create the box files and lstmf files?
Loaded 54/54 pages (1-54) of document data/ground-truth/out8.lstmf Loaded 20/20 pages (1-20) of document data/ground-truth/tam.Impact_ Condensed.exp0.lstmf Loaded 8/8 pages (1-8) of document data/ground-truth/out5.lstmf Loaded 28/28 pages (1-28) of document data/ground-truth/out2.lstmf Loaded 58/58 pages (1-58) of document data/ground-truth/out6.lstmf Loaded 57/57 pages (1-57) of document data/ground-truth/out3.lstmf Loaded 55/55 pages (1-55) of document data/ground-truth/out9.lstmf Loaded 56/56 pages (1-56) of document data/ground-truth/out4.lstmf
On Mon, Apr 22, 2019 at 10:23 AM nijanthan0 notifications@github.com wrote:
lstmtraining --traineddata data/tamtrain/tamtrain.traineddata --old_traineddata tesseract/tessdata/tam.traineddata --continue_from data/tam/tam.lstm --net_spec '[Lfx256 O1c111]' --model_output data/checkpoints --debug_level -1 --learning_rate 20e-4 --train_listfile data/list.train --eval_listfile data/list.eval --max_iterations 3000 Loaded file data/checkpoints_checkpoint, unpacking... Successfully restored trainer from data/checkpoints_checkpoint Loaded 54/54 pages (1-54) of document data/ground-truth/out8.lstmf Loaded 20/20 pages (1-20) of document data/ground-truth/tam.Impact_Condensed.exp0.lstmf Loaded 8/8 pages (1-8) of document data/ground-truth/out5.lstmf Loaded 28/28 pages (1-28) of document data/ground-truth/out2.lstmf Loaded 58/58 pages (1-58) of document data/ground-truth/out6.lstmf Loaded 57/57 pages (1-57) of document data/ground-truth/out3.lstmf Loaded 55/55 pages (1-55) of document data/ground-truth/out9.lstmf Loaded 56/56 pages (1-56) of document data/ground-truth/out4.lstmf Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Encoding of string failed! Failure bytes: 23 20 2d ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff88 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffaf ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffaf ffffff8d 20 ffffffe0 ffffffae ffffff89 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffae ffffffbe ffffffe0 ffffffae ffffffb1 ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffb0 ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8a ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffa4 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff95 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffff99 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d 20 32 36 20 2d 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff95 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8d 20 35 Can't encode transcription: 'வயது : 01.01. 2019 ல் # -துணை பட்டியலில் உள்ளவாறு திருத்தப்பட்டுள்ளது மொத்த பக்கங்கள் 26 - பக்கம் 5' in language '' Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Encoding of string failed! Failure bytes: 23 ffffffc2 ffffffa3 34 30 31 30 20 31 36 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffaf ffffff80 ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffff8e ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 2e 20 33 36 20 31 30 31 30 20 31 36 Can't encode transcription: '- 7010 16 வீட்டு எண். % #£4010 16 வீட்டு எண். 36 1010 16' in language '' Compute CTC targets failed! Encoding of string failed! Failure bytes: ffffffe0 ffffffaf ffffff8c ffffffe0 ffffffae ffffffb0 ffffffe0 ffffffae ffffffbf 20 2d Can't encode transcription: 'பெயர்: கீதா - பெயர். கௌரி -' in language '' Encoding of string failed! Failure bytes: 23 30 30 34 30 20 31 38 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffaf ffffff80 ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffff8e ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 2e 20 31 39 20 23 31 30 34 30 20 31 38 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffaf ffffff80 ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffff8e ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 2e 20 31 39 20 31 30 34 30 20 31 38 Can't encode transcription: 'வீட்டு எண். 18 #40 https://github.com/tesseract-ocr/tesseract/issues/40 18 வீட்டு எண். 19
1040 https://github.com/tesseract-ocr/tesseract/pull/1040 18 வீட்டு
எண். 19 1040 18' in language '' Encoding of string failed! Failure bytes: 5c ffffffe0 ffffffaf ffffffa8 20 7c 20 7c 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffae ffffffaf ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 3a 20 32 32 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffffbe ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffa9 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8d 20 3a ffffffe0 ffffffae ffffffaa ffffffe0 ffffffaf ffffff86 ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 20 26 ffffffe0 ffffffae ffffffb5 20 7c 20 7c 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffae ffffffaf ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 3a 20 32 35 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffffbe ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffa9 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8d 20 3a ffffffe0 ffffffae ffffffaa ffffffe0 ffffffaf ffffff86 ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 20 32 2f 32 31 31 25 30 31 Can't encode transcription: 'வயது: 37 பாலினம் :ஆண் ஃலிஸ்\௨ | | வயது: 22 பாலினம் :பெண் &வ | | வயது: 25 பாலினம் :பெண் 2/211%01' in language '' Encoding of string failed! Failure bytes: 23 30 34 30 20 31 35 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffaf ffffff80 ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffff8e ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 2e 20 31 37 20 23 30 30 34 30 20 31 36 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffaf ffffff80 ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffff8e ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 2e 20 31 33 38 20 37 30 34 30 20 31 35 Can't encode transcription: 'வீட்டு எண். 2 £#40 https://github.com/tesseract-ocr/tesseract/issues/40 15 வீட்டு எண். 17
40 https://github.com/tesseract-ocr/tesseract/issues/40 16 வீட்டு எண்.
138 7040 15' in language '' Encoding of string failed! Failure bytes: 23 36 34 30 20 3d 20 7c 20 7c 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffaf ffffff80 ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffff8e ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 2e 20 31 34 2d 26 20 31 ffffffc2 ffffffa3 31 30 34 30 20 31 32 Can't encode transcription: 'குப்புசாமிநாயக்கர் - £1640 | | | வீட்டு எண். 14-& #640 https://github.com/tesseract-ocr/tesseract/issues/640 = | | வீட்டு எண். 14-& 1£1040 12' in language '' Compute CTC targets failed! Compute CTC targets failed! Encoding of string failed! Failure bytes: 23 20 2d ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff88 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffaf ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffaf ffffff8d 20 ffffffe0 ffffffae ffffff89 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffae ffffffbe ffffffe0 ffffffae ffffffb1 ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffb0 ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8a ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffa4 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff95 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffff99 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d 20 32 36 20 2d 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff95 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8d 20 36 Can't encode transcription: 'வயது : 01.01. 2019 ல் # -துணை பட்டியலில் உள்ளவாறு திருத்தப்பட்டுள்ளது மொத்த பக்கங்கள் 26 - பக்கம் 6' in language '' Compute CTC targets failed! Encoding of string failed! Failure bytes: 23 20 2d ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff88 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffaf ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffb2 ffffffe0 ffffffaf ffffff8d 20 ffffffe0 ffffffae ffffff89 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffae ffffffbe ffffffe0 ffffffae ffffffb1 ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffae ffffffbf ffffffe0 ffffffae ffffffb0 ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8a ffffffe0 ffffffae ffffffa4 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffffa4 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff95 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffff99 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffffb3 ffffffe0 ffffffaf ffffff8d 20 32 36 20 2d 20 ffffffe0 ffffffae ffffffaa ffffffe0 ffffffae ffffff95 ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff95 ffffffe0 ffffffae ffffffae ffffffe0 ffffffaf ffffff8d 20 39 Can't encode transcription: 'வயது : 01.01. 2019 ல் # -துணை பட்டியலில் உள்ளவாறு திருத்தப்பட்டுள்ளது மொத்த பக்கங்கள் 26 - பக்கம் 9' in language '' Compute CTC targets failed! Compute CTC targets failed! Encoding of string failed! Failure bytes: 23 31 30 34 30 20 31 38 20 ffffffe0 ffffffae ffffffb5 ffffffe0 ffffffaf ffffff80 ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff8d ffffffe0 ffffffae ffffff9f ffffffe0 ffffffaf ffffff81 20 ffffffe0 ffffffae ffffff8e ffffffe0 ffffffae ffffffa3 ffffffe0 ffffffaf ffffff8d 2e 20 33 34 20 31 30 34 30 20 31 36 Can't encode transcription: 'வீட்டு எண். 34 2040 16 வீட்டு எண். 34 #1040 https://github.com/tesseract-ocr/tesseract/pull/1040 18 வீட்டு எண். 34 1040 16' in language '' Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! This is the output after putting the debug level out2.txt https://github.com/tesseract-ocr/tesseract/files/3102202/out2.txt out3.txt https://github.com/tesseract-ocr/tesseract/files/3102203/out3.txt out4.txt https://github.com/tesseract-ocr/tesseract/files/3102204/out4.txt out5.txt https://github.com/tesseract-ocr/tesseract/files/3102205/out5.txt out6.txt https://github.com/tesseract-ocr/tesseract/files/3102206/out6.txt out7.txt https://github.com/tesseract-ocr/tesseract/files/3102207/out7.txt
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2395#issuecomment-485324445, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG37I5R2EXKH757XJ3MES3PRVAGNANCNFSM4HHCRYLA .
--
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
I created box file using lsmbox comment and lstmf using lstm.train
What about Loaded 20/20 pages (1-20) of document data/ground-truth/tam.Impact_Condensed.exp0.lstmf?
Impact_condensed font does not support Tamil?
The problem is related to your input files. Please share training text or image and box pair.
On Mon, Apr 22, 2019 at 10:56 AM nijanthan0 notifications@github.com wrote:
I created box file using lsmbox comment and lstmf using lstm.train
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2395#issuecomment-485328885, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG37I4GW5VGG45O6POSAZDPRVEAHANCNFSM4HHCRYLA .
--
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
Mam but for --eval_listfile I don't know what to give as input so i manually created one impact_condensed font file and then stored in eval listfile.
this is my files. ground-truth.zip
Mam but for --eval_listfile I don't know what to give as input so i manually created one impact_condensed font file and then stored in eval listfile.
You have a large number of training files, use one of them for eval (eg. ocr2).
I am wondering whether Compute CTC targets failed!
is related to the impact_condensed eval file.
I will test further with all the files you sent and get back.
No mam "Compute CTC targets failed!" is not related to impact_condensed eval file.
this is my files. ground-truth.zip
The zip file has the OCRed text for the images. The ground truth needs to be the correct transcription for the images.
But I am not using text file in the training process.
Training uses box/tiff pairs for creating the lstmf files. If you give the wrong text for an image then all training will be wrong. Your box files also hold incorrect text only.
I tested by using the wordstrbox
(without correcting the text).
lstmtraining \
--model_output build/poll \ --continue_from ~/tessdata_best/script/Tamil.lstm \ --traineddata ~/tessdata_best/script/Tamil.traineddata \ --train_listfile build/tam.poll.training_files.txt \ --debug_interval -1 Loaded file /home/ubuntu/tessdata_best/script/Tamil.lstm, unpacking... Warning: LSTMTrainer deserialized an LSTMRecognizer! Continuing from /home/ubuntu/tessdata_best/script/Tamil.lstm Loaded 54/54 lines (1-54) of document input/out10.lstmf Loaded 53/53 lines (1-53) of document input/out11.lstmf Loaded 8/8 lines (1-8) of document input/out5.lstmf Loaded 57/57 lines (1-57) of document input/out7.lstmf Loaded 57/57 lines (1-57) of document input/out3.lstmf Loaded 54/54 lines (1-54) of document input/out8.lstmf Loaded 56/56 lines (1-56) of document input/out4.lstmf Loaded 58/58 lines (1-58) of document input/out6.lstmf Loaded 55/55 lines (1-55) of document input/out9.lstmf Iteration 0: GROUND TRUTH : பெயர்: மோகனா - பெயர்: மாதவன் - பெயர்: தமிழ்ச்செல்வி - Iteration 0: ALIGNED TRUTH : பெயர்: மோகனா - பெயர்: மாதவன் - பெயர்: தமிழ்ச்செல்வி - Iteration 0: BEST OCR TEXT : பெயர்: மோகனா - F|பெயர்: மாதவன்- . |-|பெயர்: தமிழ்ச்செல்வி - File input/out10.lstmf line 0 : Mean rms=2.079%, delta=6.897%, train=11.111%(44.444%), skip ratio=0% Iteration 1: GROUND TRUTH : சட்டமன்றத் தொகுதி எண் மற்றும் பெயர் : 36-உத்திரமேரூர் பாகம் எண்: 1 Iteration 1: ALIGNED TRUTH : சட்டமன்றத் தொகுதி எண் மற்றும் பெயர் : 36-உத்திரமேரூர் பாகம் எண்: Iteration 1: BEST OCR TEXT : சட்டமன்றத் தொகுதி எண் மற்றும் பெயர் : %-உத்திரமேரர் .......எபாகம் எண்: 1 File input/out11.lstmf line 0 : Mean rms=2.03%, delta=4.742%, train=15.257%(32.222%), skip ratio=0% Iteration 2: GROUND TRUTH : பெயர்: கிருட்டினன் - பெயர்: வேதவல்லி - பெயர்: குப்பன் - Iteration 2: ALIGNED TRUTH : பெயர்: கிருட்டினன் - பெயர்: வேதவல்லி - பெயர்: குப்பன் - Iteration 2: BEST OCR TEXT : பெயர்: கிருட்டினன்- [பெயர்: வேதவல்லி- [பெயர்: குப்பன் - File input/out3.lstmf line 0 : Mean rms=1.848%, delta=4.556%, train=13.148%(43.704%), skip ratio=0% Iteration 3: GROUND TRUTH : கணவர் பெயர்: முருகன் - தந்தை பெயர்: காசி - தந்தை பெயர்: இராமன் - Iteration 3: ALIGNED TRUTH : கணவர் பெயர்: முருகன் - தந்தை பெயர்: காசி - தந்தை பெயர்: இராமன் - Iteration 3: BEST OCR TEXT : கணவர் பெயர்: முருகன்- | ந|[தந்தைபெயர் காச- [ ந|தந்தை பெயர்: இராமன் - File input/out4.lstmf line 0 : Mean rms=1.916%, delta=4.98%, train=13.707%(47.361%), skip ratio=0% Iteration 4: GROUND TRUTH : பெயர்: கீதா - பெயர்: கெளரி - Iteration 4: ALIGNED TRUTH : பெயர்: கீதா - பெயர்: கெளரி - Iteration 4: BEST OCR TEXT : பெயர். கதா - |[பெயர்: கெளரி - File input/out5.lstmf line 0 : Mean rms=1.894%, delta=4.984%, train=15.103%(47.889%), skip ratio=0% Iteration 5: GROUND TRUTH : தந்த பெயர்: குட்டியப்பன் - கணவர் பெயர்: முனுசாமி - தந்த பெயர்: கன்னியப்பன் - Iteration 5: ALIGNED TRUTH : தந்த பெயர்: குட்டியப்பன் - கணவர் பெயர்: முனுசாமி - தந்த பெயர்: கன்னியப்பன் - Iteration 5: BEST OCR TEXT : தந்த பெயர்: குட்டியப்பன்- | |[|கணவர் பெயர்: முனுசாமி- | |[[தந்தை பெயர்: கன்னியப்பன் - File input/out6.lstmf line 0 : Mean rms=1.904%, delta=5.48%, train=14.751%(48.241%), skip ratio=0% Iteration 6: GROUND TRUTH : வயது: ஏ பாலினம் :ஆண் வயது: % பாலினம் ஆண் வயது: 4 பாலினம் :பெண் Iteration 6: BEST OCR TEXT : வயது: ஏ பாலினம் :ஆண் | Aிleble ||வயது: % பாலினம் :ஆண் | rileble [வயது: & பாலினம் :-பெண் File input/out7.lstmf line 0 : Mean rms=2.04%, delta=7.057%, train=18.539%(47.302%), skip ratio=0% Iteration 7: GROUND TRUTH : வயது: ஏ பாலினம் :ஆண் Available ||வயது: 22 பாலினம் பெண் Available ||வயது: 25 பாலினம் பெண் Available Iteration 7: ALIGNED TRUTH : வயது: ஏ பாலினம் :ஆண் Available ||வயது: 22 பாலினம் பெண் Avllable ||வயது: 25 பாலினம் பெண் Available Iteration 7: BEST OCR TEXT : வயது: ஏ பாலினம் :ஆண் | இவிஷ்ீ |[வயது: 2 பாலினம் பெண் | விஷ்ச [[வயது: 25 பாலினம் -பெண் | vailable File input/out8.lstmf line 0 : Mean rms=2.089%, delta=7.843%, train=20.893%(47.222%), skip ratio=0% Iteration 8: GROUND TRUTH : TRQO0226621 TN/O5/026/0393067 TN/O5/026/0393295 Iteration 8: ALIGNED TRUTH : TRQO0226621 TN/O5/026/0393067 TN/O5/026/0393295 Iteration 8: BEST OCR TEXT : TRQ0226621|[ V TNOSIO260393067 [ TNO5I026/0393295 File input/out9.lstmf line 0 : Mean rms=2.099%, delta=8.025%, train=22.507%(53.086%), skip ratio=0% Iteration 9: GROUND TRUTH : வீட்டு எண்: 41 Photo is வீட்டு எண்: 41 Photo is வீட்டு எண்: 41 Photo is Iteration 9: ALIGNED TRUTH : வீட்டு எண்: 41 Photo is வீட்டு எண்: 441 Photo is வீட்டு எண்: 41 Photo is Iteration 9: BEST OCR TEXT : வீட்டுஎண்4 | Photois |வீட்டுஎண்்4 | Photois |வீட்டுஎண்41 | Photos File input/out10.lstmf line 1 : Mean rms=2.091%, delta=8.12%, train=22.618%(57.778%), skip ratio=0% Iteration 10: GROUND TRUTH : தந்த பெயர்: சின்னபையன் - தந்த பெயர்: சின்னபையன் - கணவர் பெயர்: சங்கர் - Iteration 10: ALIGNED TRUTH : தந்த பெயர்: சின்னபையன் - தந்த பெயர்: சின்னபையன் - கணவர் பெயர்: சங்கர் - Iteration 10: BEST OCR TEXT : தந்த பெயர்: சின்னபையன்- | |தந்தை பெயர்: சின்னபையன் - | [கணவர் பெயர்: சங்கர் - File input/out11.lstmf line 1 : Mean rms=2.046%, delta=7.87%, train=21.193%(55.556%), skip ratio=0% Iteration 11: GROUND TRUTH : - Photo is வீட்டு எண்: 4 Photo is வீட்டு எண்: 4 Photo is Iteration 11: ALIGNED TRUTH : ------------- Photo is வீட்டு எண்: 4 Photo is வீட்டு எண்: 4 444 Photo is Iteration 11: BEST OCR TEXT : D [ Photois வீட்டுஎண்:4 Photois |[வீட்டுிஎண்:4: | Photois
This zip file has box files for your images in wordstr
format. The text for each line needs to be corrected to match the image. Then you can use these box files with your images to create the lstmf files and then use them for lstmtraining.
However, some errors maybe because of incorrect layout analysis and more training will not fix those.
You need to use some other method, opencv, uzn etc to mark areas and then recognize them separately.
Anyway, in all my testing, didn't get the error Compute CTC targets failed!
Can we directly use the wordstr box file for training?
The text for each line needs to be corrected to match the image
The wordstr box file can be used for training AFTER you review and correct the text for each line. Currently it has been generated using the existing Tamil traineddata so it will have all errors that you see in recognition. For training you need to correct that text so that it matches the image.
Test with one file, use debug_level -1 to make sure it looks ok. Then apply to all images.
Mam, thank you for your help.But i have one problem.
lstmtraining --traineddata data/tamiltest/tamiltest.traineddata --old_traineddata tesseract/tessdata/tam.traineddata --continue_from data/tam/tam.lstm --perfect_sample_delay 0 --target_error_rate 0.01 --model_output data/checkpoints --debug_level -1 --train_listfile data/list.train --eval_listfile data/list.eval --max_iterations 10000 Loaded file data/checkpoints_checkpoint, unpacking... Successfully restored trainer from data/checkpoints_checkpoint Loaded 46/46 pages (1-46) of document data/ground-truth/out25.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out26.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out20.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out22.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out23.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out27.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out18.lstmf Loaded 35/35 pages (1-35) of document data/ground-truth/out21.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out29.lstmf Loaded 56/56 pages (1-56) of document data/ground-truth/out8.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out14.lstmf At iteration 10000/10000/10000, Mean rms=5.03%, delta=48.282%, char train=98.223%, word train=97.784%, skip ratio=0%, New worst char error = 98.223 wrote checkpoint.
Finished! Error rate = 98.136
How do i reduce the error rate ??
If you are running with --debug_level -1
you will have details of every
iteration. Usually the error rate will keep going down.
It seems to me that you are training with about 500 lines of text.
Are you getting any errors during training? Run for --max_iterations 200
and look at the console log.
On Thu, Apr 25, 2019 at 12:21 PM nijanthan0 notifications@github.com wrote:
How do i reduce the error rate ??
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2395#issuecomment-486542215, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG37I4OGA2THOB7GDW2RW3PSFIIHANCNFSM4HHCRYLA .
--
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
lstmtraining --traineddata data/tamiltest/tamiltest.traineddata --old_traineddata data/tam/tam.traineddata --continue_from data/tam/tam.lstm --perfect_sample_delay 0 --target_error_rate 0.01 --model_output data/checkpoints --debug_level -1 --train_listfile data/list.train --eval_listfile data/list.eval --max_iterations 200 Loaded file data/tam/tam.lstm, unpacking... Warning: LSTMTrainer deserialized an LSTMRecognizer! Code range changed from 99 to 145! Num (Extended) outputs,weights in Series: 1,36,0,1:1, 0 Num (Extended) outputs,weights in Series: C3,3:9, 0 Ft16:16, 160 Total weights = 160 [C3,3Ft16]:16, 160 Mp3,3:16, 0 Lfys48:48, 12480 Lfx96:96, 55680 Lrx96:96, 74112 Lfx192:192, 221952 Fc145:145, 27985 Total weights = 392369 Previous null char=2 mapped to 144 Continuing from data/tam/tam.lstm Loaded 46/46 pages (1-46) of document data/ground-truth/out25.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out26.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out27.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out22.lstmf Loaded 35/35 pages (1-35) of document data/ground-truth/out21.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out23.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out18.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out20.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out29.lstmf Loaded 56/56 pages (1-56) of document data/ground-truth/out8.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out14.lstmf Loaded 35/35 pages (1-35) of document data/ground-truth/out1.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out24.lstmf Loaded 55/55 pages (1-55) of document data/ground-truth/out13.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out16.lstmf Loaded 35/35 pages (1-35) of document data/ground-truth/out6.lstmf Loaded 35/35 pages (1-35) of document data/ground-truth/out9.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out30.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out11.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out12.lstmf Loaded 45/45 pages (1-45) of document data/ground-truth/out10.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out7.lstmf Loaded 35/35 pages (1-35) of document data/ground-truth/out19.lstmf Loaded 47/47 pages (1-47) of document data/ground-truth/out28.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out17.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out5.lstmf Loaded 57/57 pages (1-57) of document data/ground-truth/out15.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out4.lstmf Loaded 34/34 pages (1-34) of document data/ground-truth/out3.lstmf At iteration 100/100/100, Mean rms=5.965%, delta=66.111%, char train=154.374%, word train=99.521%, skip ratio=0%, New worst char error = 154.374 wrote checkpoint.
At iteration 200/200/200, Mean rms=6.777%, delta=86.539%, char train=165.383%, word train=99.594%, skip ratio=0%, New worst char error = 165.383 wrote checkpoint.
Finished! Error rate = 100
I didn't get any error during training.
I first extracted data file from image using " Tamil " tessdata and then i corrected the values of the text file. Then using the text file I created box file and tif file with help of text2image. Then I used " tam " tessdata for other training purpose(like unicharset,lstm training). Is this causes of high error rate?
Your images have English in them. If you want that to be recognized it needs to be in your unicharset.
The tam.traineddata has a limited unicharset. By using that, a larger number of characters have to be added.
Try using Tamil.traineddata for further training and see if that is better.
I am not sure why you are not getting debug msgs on screen.
On Thu, 25 Apr 2019, 13:56 nijanthan0, notifications@github.com wrote:
I first extracted data file from image using " Tamil " tessdata and then i corrected the values of the text file. Then using the text file I created box file and tif file with help of text2image. Then I used " tam " tessdata for other training purpose(like unicharset,lstm training). Is this causes of high error rate?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2395#issuecomment-486570179, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG37IZ3W4U5BUFXYK6AGBDPSFTMHANCNFSM4HHCRYLA .
Code range changed from 99 to 145!
tam.unicharset is 99, your text has 145 unichars.
On Thu, Apr 25, 2019 at 2:13 PM Shree Devi Kumar shreeshrii@gmail.com wrote:
Your images have English in them. If you want that to be recognized it needs to be in your unicharset.
The tam.traineddata has a limited unicharset. By using that, a larger number of characters have to be added.
Try using Tamil.traineddata for further training and see if that is better.
I am not sure why you are not getting debug msgs on screen.
On Thu, 25 Apr 2019, 13:56 nijanthan0, notifications@github.com wrote:
I first extracted data file from image using " Tamil " tessdata and then i corrected the values of the text file. Then using the text file I created box file and tif file with help of text2image. Then I used " tam " tessdata for other training purpose(like unicharset,lstm training). Is this causes of high error rate?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2395#issuecomment-486570179, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG37IZ3W4U5BUFXYK6AGBDPSFTMHANCNFSM4HHCRYLA .
--
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
--debug_interval -1
It is interval
not level
.
-1 is minus one
On Thu, Apr 25, 2019 at 2:18 PM Shree Devi Kumar shreeshrii@gmail.com wrote:
Code range changed from 99 to 145!
tam.unicharset is 99, your text has 145 unichars.
On Thu, Apr 25, 2019 at 2:13 PM Shree Devi Kumar shreeshrii@gmail.com wrote:
Your images have English in them. If you want that to be recognized it needs to be in your unicharset.
The tam.traineddata has a limited unicharset. By using that, a larger number of characters have to be added.
Try using Tamil.traineddata for further training and see if that is better.
I am not sure why you are not getting debug msgs on screen.
On Thu, 25 Apr 2019, 13:56 nijanthan0, notifications@github.com wrote:
I first extracted data file from image using " Tamil " tessdata and then i corrected the values of the text file. Then using the text file I created box file and tif file with help of text2image. Then I used " tam " tessdata for other training purpose(like unicharset,lstm training). Is this causes of high error rate?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2395#issuecomment-486570179, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG37IZ3W4U5BUFXYK6AGBDPSFTMHANCNFSM4HHCRYLA .
--
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
--
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
lstmtraining --traineddata data/tamiltest/tamiltest.traineddata --old_traineddata data/tam/Tamil.traineddata --continue_from data/tam/tam.lstm --perfect_sample_delay 0 --target_error_rate 0.01 --model_output data/checkpoints --debug_interval -1 --train_listfile data/list.train --max_iterations 200
Loaded file data/checkpoints_checkpoint, unpacking...
Code range changed from 117 to 173!
Must supply the old traineddata for code conversion!
Loaded file data/tam/tam.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 99 to 173!
Num (Extended) outputs,weights in Series:
1,36,0,1:1, 0
Num (Extended) outputs,weights in Series:
C3,3:9, 0
Ft16:16, 160
Total weights = 160
[C3,3Ft16]:16, 160
Mp3,3:16, 0
Lfys48:48, 12480
Lfx96:96, 55680
Lrx96:96, 74112
Lfx192:192, 221952
Fc99:99, 19107
Total weights = 383491
Previous null char=2 mapped to 172
Continuing from data/tam/tam.lstm
Loaded 25/25 pages (1-25) of document data/ground-truth/out34.lstmf
Loaded 23/23 pages (1-23) of document data/ground-truth/out32.lstmf
Loaded 20/20 pages (1-20) of document data/ground-truth/out31.lstmf
Loaded 23/23 pages (1-23) of document data/ground-truth/out35.lstmf
lstmtraining: ../../src/ccutil/genericvector.h:724: T& GenericVector
"If i use Tamil.traineddata in the old trained data i get an error and also i used Tamil.lstm unicharset"
--debug_interval -1 It is
interval
notlevel
. -1 is minus one On Thu, Apr 25, 2019 at 2:18 PM Shree Devi Kumar shreeshrii@gmail.com wrote: …Code range changed from 99 to 145! tam.unicharset is 99, your text has 145 unichars. On Thu, Apr 25, 2019 at 2:13 PM Shree Devi Kumar @.> wrote: > Your images have English in them. If you want that to be recognized it > needs to be in your unicharset. > > The tam.traineddata has a limited unicharset. By using that, a larger > number of characters have to be added. > > Try using Tamil.traineddata for further training and see if that is > better. > > I am not sure why you are not getting debug msgs on screen. > > > On Thu, 25 Apr 2019, 13:56 nijanthan0, @.> wrote: > >> I first extracted data file from image using " Tamil " tessdata and then >> i corrected the values of the text file. Then using the text file I created >> box file and tif file with help of text2image. Then I used " tam " tessdata >> for other training purpose(like unicharset,lstm training). Is this causes >> of high error rate? >> >> — >> You are receiving this because you commented. >> Reply to this email directly, view it on GitHub >> <#2395 (comment)>, >> or mute the thread >> https://github.com/notifications/unsubscribe-auth/ABG37IZ3W4U5BUFXYK6AGBDPSFTMHANCNFSM4HHCRYLA >> . >> > -- ____ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
____ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
yes, Thank You, Now only i changed it ...😊
--old_traineddata data/tam/Tamil.traineddata --continue_from data/tam/tam.lstm
Both need to be in sync.
Tamil.traineddata Tamil.lstm
Sorry I used fast tessdata of " Tamil.trainedata". Now only i am using best tessdata
lstmtraining --traineddata data/tamiltest/tamiltest.traineddata --old_traineddata tesseract/tessdata/Tamil.traineddata --continue_from data/Tamil/Tamil.lstm --perfect_sample_delay 0 --target_error_rate 0.01 --model_output data/checkpoints --debug_interval -1 --train_listfile data/list.train --max_iterations 200 Loaded file data/checkpoints_checkpoint, unpacking... Successfully restored trainer from data/checkpoints_checkpoint Loaded 20/20 pages (1-20) of document data/ground-truth/out31.lstmf Loaded 23/23 pages (1-23) of document data/ground-truth/out35.lstmf Loaded 23/23 pages (1-23) of document data/ground-truth/out32.lstmf Loaded 24/24 pages (1-24) of document data/ground-truth/out33.lstmf At iteration 200/200/200, Mean rms=5.646%, delta=70.988%, char train=138.671%, word train=99.23%, skip ratio=0%, New worst char error = 138.671 wrote checkpoint.
Finished! Error rate = 100
"In This also 100 error rate"
try with command similar to what i used - see https://github.com/tesseract-ocr/tesseract/issues/2395#issuecomment-485419395
Same error
What does same error mean?
200 iterations was to test what was going wrong. Now you can train for more iterations.
For impact style fine tuning try 400-600 iterations.
For plus type fine tuning try 3000-3600.
Is this caused by the parameters of x_size not same in data generation and train?
Pull request #3251 improves the error message for "Compute CTC target failed" and now shows the lstmf
file which is triggering that error. One possible reason for that error is a rotated text line.
@Shreeshrii I am always getting this error when I'm trying to train for Arabic. I am adding my own data in the "training_text" file and it consists of a lot of arabic numbers and dates. I am constantly getting this issue. But I need to train the model into recognizing such dates and numbers, I'd rather not use different trained data, one for numbers and one for words. Any idea how to solve such an issue?
I had a similar issue ("Compute CTC targets failed!") when I generated two lstmf files from a different type of .box files. One box file was generated with boxes as full-width horizontal lines of text. Another box file was generated with boxes for each particular letter of the text. I had to regenerate box files (train and eval) using the same type of --psm parameters, and after that, the training went smoothly.
Hi i'm trying lstmtraining for tamil text, i'm facing compute ctc error Compute CTC targets failed for /home/user/Aadhar/data/Aadhar-ground-truth/1.lstmf! (=57 On [0, 2), scores= 1.12(:=59=1.11) 1.13(ஏ=64=1.12), Mean=1.12237, max=1.12645 ஏ=64 On [2, 4), scores= 1.13((=57=1.13) 1.13((=57=1.13), Mean=1.12687, max=1.1281 (=57 On [4, 6), scores= 1.13(ஏ=64=1.13) 1.14(ஏ=64=1.13), Mean=1.1351, max=1.13527 ஏ=64 On [6, 8), scores= 1.13((=57=1.14) 1.13((=57=1.14), Mean=1.12844, max=1.12863 (=57 On [8, 11), scores= 1.14(ஏ=64=1.13) 1.14(ஹ=66=1.13) 1.14(ஹ=66=1.13), Mean=1.13606, max=1.13633 Compute CTC targets failed for /home/user/Aadhar/data/Aadhar-ground-truth/2.lstmf! (=57 On [0, 2), scores= 1.12(:=59=1.11) 1.13(ஏ=64=1.12), Mean=1.12268, max=1.12686 ஏ=64 On [2, 4), scores= 1.12((=57=1.13) 1.13((=57=1.13), Mean=1.12615, max=1.12734 (=57 On [4, 6), scores= 1.14(ஏ=64=1.13) 1.14(ஏ=64=1.13), Mean=1.13574, max=1.13588 ஏ=64 On [6, 8), scores= 1.13((=57=1.14) 1.13((=57=1.14), Mean=1.12798, max=1.12808 (=57 On [8, 10), scores= 1.14(ஏ=64=1.13) 1.14(ஏ=64=1.13), Mean=1.13546, max=1.13548 ஹ=66 On [10, 12), scores= 1.13((=57=1.14) 1.13((=57=1.14), Mean=1.12787, max=1.12802 (=57 On [12, 14), scores= 1.14(ஹ=66=1.13) 1.14(ஹ=66=1.13), Mean=1.13633, max=1.13645 Compute CTC targets failed for /home/user/Aadhar/data/Aadhar-ground-truth/3.lstmf! (=57 On [0, 2), scores= 1.12(:=59=1.11) 1.13(ஏ=64=1.12), Mean=1.12269, max=1.12686 ஏ=64 On [2, 4), scores= 1.13((=57=1.13) 1.13((=57=1.13), Mean=1.12622, max=1.12741 (=57 On [4, 6), scores= 1.14(ஏ=64=1.13) 1.14(ஏ=64=1.13), Mean=1.13572, max=1.13585 ஏ=64 On [6, 8), scores= 1.13((=57=1.14) 1.13((=57=1.14), Mean=1.12804, max=1.12813 (=57 On [8, 10), scores= 1.14(ஏ=64=1.13) 1.14(ஏ=64=1.13), Mean=1.13534, max=1.13538 ஹ=66 On [10, 12), scores= 1.13((=57=1.14) 1.13((=57=1.14), Mean=1.12775, max=1.12785 (=57 On [12, 15), scores= 1.14(ஹ=66=1.13) 1.14(ஹ=66=1.13) 1.14(ஹ=66=1.13), Mean=1.13621, max=1.13649
В обучении для создания файлов lstmf используются пары box/tiff. Если вы дадите неверный текст для изображения, то все обучение будет неправильным. Файлы вашего ящика также содержат только неверный текст.
that is, if I want to train the tesseract on text that it cannot see in the image, it will throw this error?
@karan00713, @kiberchert, recent software versions report the line image which caused the message. I suggest to visually inspect such images whether they are reasonable (not more than a single line, not rotated) and compare whether line image and line transcription match.
This error occurs on my pc when text2image created empty tif files. The .lstmf files created out of those empty tif files trigger the error. It looks like text2image has a lot of bugs. It created empty box files, as well as empty image files.
Environment
Current Behavior:
Expected Behavior:
Suggested Fix:
surasystem@surasystem:~$ lstmtraining --traineddata data/tamtrain/tamtrain.traineddata --old_traineddata tesseract/tessdata/tam.traineddata --continue_from data/tam/tam.lstm --net_spec '[Lfx256 O1c111]' --model_output data/checkpoints --learning_rate 20e-4 --train_listfile data/list.train --eval_listfile data/list.eval --max_iterations 3000 Loaded file data/tam/tam.lstm, unpacking... Warning: LSTMTrainer deserialized an LSTMRecognizer! Code range changed from 99 to 104! Num (Extended) outputs,weights in Series: 1,36,0,1:1, 0 Num (Extended) outputs,weights in Series: C3,3:9, 0 Ft16:16, 160 Total weights = 160 [C3,3Ft16]:16, 160 Mp3,3:16, 0 Lfys48:48, 12480 Lfx96:96, 55680 Lrx96:96, 74112 Lfx192:192, 221952 Fc104:104, 20072 Total weights = 384456 Previous null char=2 mapped to 103 Continuing from data/tam/tam.lstm Loaded 54/54 pages (1-54) of document data/ground-truth/out8.lstmf Loaded 57/57 pages (1-57) of document data/ground-truth/tam.TAMu_Kadambri.exp0.lstmf Loaded 20/20 pages (1-20) of document data/ground-truth/tam.Impact_Condensed.exp0.lstmf Loaded 8/8 pages (1-8) of document data/ground-truth/out5.lstmf Loaded 28/28 pages (1-28) of document data/ground-truth/out2.lstmf Loaded 57/57 pages (1-57) of document data/ground-truth/out3.lstmf Loaded 56/56 pages (1-56) of document data/ground-truth/out4.lstmf Loaded 55/55 pages (1-55) of document data/ground-truth/out9.lstmf Loaded 58/58 pages (1-58) of document data/ground-truth/out6.lstmf Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed! Compute CTC targets failed!
what i want to do to overcome this issue..