Open Shreeshrii opened 7 years ago
Still getting the errors with the following version -
tesseract -v
tesseract 4.00.00alpha-219-gc124f87
leptonica-1.74
libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8
Can't encode transcription: सगुनल उठैलका देउता नेउता लवरना लोहमान कुदार
Encoding of string failed! Failure bytes: ffffffe0 ffffffa5 ffffff9c ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffffaf ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 fffff
fa4 ffffffb9 ffffffe0 ffffffa5 ffffff9c ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffffaf ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4 ffffffb2 ffffffe0 ffffffa
4 ffffffbe ffffffe0 ffffffa4 ffffffa6 ffffffe0 ffffffa4 ffffffa8 ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4 ffffff85 ffffffe0 ffffffa4 ffffffa7 ffffffe0 ffffffa4
ffffffb8 ffffffe0 ffffffa5 ffffff87 ffffffe0 ffffffa4 ffffffb0 ffffffe0 ffffffa5 ffffff80 20 ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ff
ffffac ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ffffff95 ffffffe0 ffffffa4 ffffffa8 ffffffe0 ffffffa4 ffffffbe
Can't encode transcription: बिसहरी सड़िया हड़िया लादना अधसेरी सुबुकना
Encoding of string failed! Failure bytes: ffffffe0 ffffffa5 ffffff9c ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffffaf ffffffe0 ffffffa4 ffffffa8 20 ffffffe0 fffff
fa4 ffffffac ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ffffffa6 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffa7 ffffffe0 ffffffa4 ffffffbf 20 ffffffe0 ffffffa
4 ffffff97 ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ffffffaa ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffa4 ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4
ffffffb6 ffffffe0 ffffffa4 ffffffbe ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa4 ffffffa8 ffffffe0 ffffffa4 ffffffae ffffffe0 ffffffa5 ffffff87 20 ffffffe0 ffffffa4 ff
ffffb8 ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ffffffa6 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffa7 ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4 ffff
ff9c ffffffe0 ffffffa4 ffffff81 ffffffe0 ffffffa4 ffffffa4 ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa4 ffffffbe ffffffe0 ffffffa4 ffffffb0 20 ffffffe0 ffffffa4 ffffff
a8 ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffff97 ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ffffffa8 ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffffaf ff
ffffe0 ffffffa4 ffffffbe ffffffe0 ffffffa4 ffffff81
Can't encode transcription: चूड़ियन बुद्धि गुप्ता शासनमे सुद्धा जँतसार निगुनियाँ
Encoding of string failed! Failure bytes: ffffffe0 ffffffa5 ffffff9c ffffffe0 ffffffa4 ffffff87 ffffffe0 ffffffa4 ffffffb2 ffffffe0 ffffffa5 ffffff82 ffffffe0 ffffffa4
ffffff81 20 ffffffe0 ffffffa4 ffffffaa ffffffe0 ffffffa5 ffffff8b ffffffe0 ffffffa4 ffffffa5 ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4 ffffffac ffffffe0 ffffffa
5 ffffff8b ffffffe0 ffffffa4 ffffffa5 ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4 ffffffae ffffffe0 ffffffa5 ffffff8b ffffffe0 ffffffa4 ffffffa5 ffffffe0 ffffffa4
ffffffbe 20 ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffb5 ffffffe0 ffffffa5 ffffff87 ffffffe0 ffffffa4 ffffff9a ffffffe0 ffffffa5 ff
ffff8d ffffffe0 ffffffa4 ffffff9b ffffffe0 ffffffa4 ffffffbe ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa4 ffffff81 20 ffffffe0 ffffffa4 ffffffaa ffffffe0 ffffffa4 ffff
ffbe ffffffe0 ffffffa4 ffffffb0 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffff9f ffffffe0 ffffffa5 ffffff80 20 ffffffe0 ffffffa4 ffffffb2 ffffffe0 ffffffa5 ffffff
9c ffffffe0 ffffffa4 ffffff95 ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffffaf ffffffe0 ffffffa4 ffffffa8
Can't encode transcription: दौड़इलूँ पोथा बोथा मोथा स्वेच्छासँ पार्टी लड़कियन
@Also seen in finetune of Arabic
lstmtraining --model_output ~/tesstutorial/aratuned_from_ara/aratuned --continue_from ~/tesstutorial/aratuned_from_ara/ara.lstm --train_listfile ~/tesstutorial/ara/ara.training_files.txt --eval_listfile ~/tesstutorial/aratest/ara.training_files.txt --target_error_
rate 0.0001
Loaded file /home/shree/tesstutorial/aratuned_from_ara/aratuned_checkpoint, unpacking...
Successfully restored trainer from /home/shree/tesstutorial/aratuned_from_ara/aratuned_checkpoint
Loaded 229/229 pages (1-229) of document /home/shree/tesstutorial/ara/ara.Amiri.exp0.lstmf
Loaded 232/232 pages (1-232) of document /home/shree/tesstutorial/ara/ara.Arial.exp0.lstmf
Loaded 4/4 pages (1-4) of document /home/shree/tesstutorial/aratest/ara.Times_New_Roman.exp0.lstmf
Encoding of string failed! Failure bytes: ffffffd9 ffffff8e ffffffd9 ffffff8a ffffffd9 ffffff82 ffffffd9 ffffff90 ffffffd8 ffffffaf ffffffd9 ffffff90 ffffffd8 ffffffa7
ffffffd8 ffffffb5 ffffffd9 ffffff8e 20 ffffffd9 ffffff85 ffffffd9 ffffff92 ffffffd8 ffffffaa ffffffd9 ffffff8f ffffffd9 ffffff86 ffffffd9 ffffff92 ffffffd9 ffffff83 f
fffffd9 ffffff8f 20 ffffffd9 ffffff86 ffffffd9 ffffff92 ffffffd8 ffffffa5 ffffffd9 ffffff90 20 ffffffd8 ffffffa7 ffffffd9 ffffff84 ffffffd9 ffffff84 ffffffd9 ffffff91
ffffffd9 ffffff8e ffffffd9 ffffff87 ffffffd9 ffffff90 20 ffffffd9 ffffff86 ffffffd9 ffffff90 ffffffd9 ffffff88 ffffffd8 ffffffaf ffffffd9 ffffff8f 20 ffffffd9 ffffff86
ffffffd9 ffffff92 ffffffd9 ffffff85 ffffffd9 ffffff90 20 ffffffd9 ffffff85 ffffffd9 ffffff92 ffffffd9 ffffff83 ffffffd9 ffffff8f ffffffd8 ffffffa1 ffffffd9 ffffff8e f
fffffd8 ffffffa7 ffffffd8 ffffffaf ffffffd9 ffffff8e ffffffd9 ffffff87 ffffffd9 ffffff8e ffffffd8 ffffffb4 ffffffd9 ffffff8f
Can't encode transcription: نَيقِدِاصَ مْتُنْكُ نْإِ اللَّهِ نِودُ نْمِ مْكُءَادَهَشُ
Loaded 231/231 pages (1-231) of document /home/shree/tesstutorial/ara/ara.Arial_Unicode_MS.exp0.lstmf
Encoding of string failed! Failure bytes: ffffffd9 ffffff8e ffffffd9 ffffff88 ffffffd8 ffffffb1 ffffffd9 ffffff8f ffffffd8 ffffffb5 ffffffd9 ffffff90 ffffffd8 ffffffa8
ffffffd9 ffffff92 ffffffd9 ffffff8a ffffffd9 ffffff8f 20 ffffffd9 ffffff84 ffffffd9 ffffff8e ffffffd8 ffffffa7 20 ffffffd8 ffffffaa ffffffd9 ffffff8d ffffffd8 ffffffa
7 ffffffd9 ffffff85 ffffffd9 ffffff8e ffffffd9 ffffff84 ffffffd9 ffffff8f ffffffd8 ffffffb8 ffffffd9 ffffff8f 20 ffffffd9 ffffff8a ffffffd9 ffffff81 ffffffd9 ffffff90
20 ffffffd9 ffffff85 ffffffd9 ffffff92 ffffffd9 ffffff87 ffffffd9 ffffff8f ffffffd9 ffffff83 ffffffd9 ffffff8e ffffffd8 ffffffb1 ffffffd9 ffffff8e ffffffd8 ffffffaa ff
ffffd9 ffffff8e ffffffd9 ffffff88 ffffffd9 ffffff8e 20 ffffffd9 ffffff85 ffffffd9 ffffff92 ffffffd9 ffffff87 ffffffd9 ffffff90 ffffffd8 ffffffb1 ffffffd9 ffffff90 ffff
ffd9 ffffff88 ffffffd9 ffffff86 ffffffd9 ffffff8f ffffffd8 ffffffa8 ffffffd9 ffffff90
Can't encode transcription: نَورُصِبْيُ لَا تٍامَلُظُ يفِ مْهُكَرَتَوَ مْهِرِونُبِ
Encoding of string failed! Failure bytes: ffffffd9 ffffff92 ffffffd9 ffffff87 ffffffd9 ffffff90
See new section in trainingtesseract-4.00
Wiki does not seem to have this section,
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
TrainingTesseract 4.00 Stefan Weil edited this page 28 days ago · 9 revisions
We have a github outage in India just now, not sure if this is related to that or wiki updation is still in todo.
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Thu, Jan 12, 2017 at 5:04 AM, theraysmith notifications@github.com wrote:
See new section in trainingtesseract-4.00
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/549#issuecomment-272030162, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o2Kj43a8uaNjjhRJt8EBMPHq9-kgks5rRWcEgaJpZM4LIjyK .
It is working correctly in Spain, Thank you all for the incredible amount of work that you have all done.
I don't see the changes either.
The wiki can be cloned as a git repo. Ray probably did some edits locally, but didn't 'push' them yet.
Changes are pushed now. I got called away yesterday before I was able to do it.
On Thu, Jan 12, 2017 at 2:36 AM, Amit D. notifications@github.com wrote:
I don't see the changes either.
The wiki can be cloned as a git repo. Ray probably did some edits locally, but didn't 'push' them.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/549#issuecomment-272130094, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056X0eolRJLjvYL3TR3hp1-wfTyoGKks5rRgJFgaJpZM4LIjyK .
-- Ray.
Encoding of string failed! Failure bytes: 9 31 32 30 30 45 6d 69 6c 69 65 2c 68 61 6e 73 4b 6f 6e 65 2e
Can't encode transcription: Møller. 1200Emilie,hansKone.
when trying to train frk
The tab character (9) at the beginning of the list of failure bytes is a dead giveaway.
On Sat, Jan 21, 2017 at 6:15 AM, Shreeshrii notifications@github.com wrote:
Encoding of string failed! Failure bytes: 9 31 32 30 30 45 6d 69 6c 69 65 2c 68 61 6e 73 4b 6f 6e 65 2e Can't encode transcription: Møller. 1200Emilie,hansKone.
when trying to train frk
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/549#issuecomment-274264239, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056Z_ATRDHUb3698yrRFfl1XSJTJM3ks5rUhMAgaJpZM4LIjyK .
-- Ray.
@Shreeshrii Is this issue resolved coz I'm getting the same when training with Telugu language..
Please see https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#error-messages-from-training
Encoding of string failed! results when the text string for a training image
cannot be encoded using the given unicharset.
Possible causes are:
- There is an un-represented character in the text, say a British Pound sign that is not in your unicharset.
- A stray unprintable character (like tab or a control character) in the text.
- There is an un-represented Indic grapheme/aksara in the text.
In any case it will result in that training image being ignored by the trainer.
If the error is infrequent, it is harmless, but it may indicate that your unicharset is inadequate for representing the language that you are training.
@harinath141 If you are getting a lot of these errors during finetune, try replace top layer training. You can use the box/tiff pairs generated for finetune. Commands will be similar to the following:
mkdir -p ~/tesstutorial/tellayer_from_tel
combine_tessdata -e ../tessdata/tel.traineddata \
~/tesstutorial/tellayer_from_tel/tel.lstm
lstmtraining -U ~/tesstutorial/tel/tel.unicharset \
--script_dir ../langdata --debug_interval 0 \
--continue_from ~/tesstutorial/tellayer_from_tel/tel.lstm \
--append_index 5 --net_spec '[Lfx256 O1c105]' \
--model_output ~/tesstutorial/tellayer_from_tel/tellayer \
--train_listfile ~/tesstutorial/tel/tel.training_files.txt \
--target_error_rate 0.01
~/tesstutorial/tel/ should have your .lstmf files.
Thank you @Shreeshrii I'll try to replace top layer
@harinath141
When you use --debug_interval 0
you will see messages every 100 iterations like the following:
At iteration 45909/58500/58569, Mean rms=0.639%, delta=0.621%, char train=1.861%, word train=13.302%, skip ratio=0%, wrote checkpoint.
At iteration 45960/58600/58669, Mean rms=0.64%, delta=0.616%, char train=1.844%, word train=12.933%, skip ratio=0%, wrote checkpoint.
2 Percent improvement time=14052, best error was 3.697 @ 31958
At iteration 46010/58700/58769, Mean rms=0.634%, delta=0.561%, char train=1.686%, word train=12.343%, skip ratio=0%, New best char error = 1.686 wrote best model:/hom
e/shree/tesstutorial/khmlayer1_from_khm/khm1.686_46010.lstm wrote checkpoint.
When you use --debug_interval -1
, messages such as the following will be shown for every iteration:
Iteration 59400: ALIGNED TRUTH : មានរូបឆ្មាំ អេស៊ីលីដា
Iteration 59400: BEST OCR TEXT : មានរូបឆ្មាំ អេស៊ីលីដា
File /tmp/tmp.BjsuuQ0dgJ/khm/khm.Noto_Serif_Khmer_Bold.exp0.lstmf page 53 (Perfect):
Mean rms=0.646%, delta=0.553%, train=1.878%(13.168%), skip ratio=0.1%
Iteration 59401: ALIGNED TRUTH : ឆ្កៀលយកភ្នែក ជួនឆ្លងវគ្គ ចាប់ពីពេលនោះមក របស់គាត់ កុំធេ្វសគំនិត។ អូនហ្អើយ =
Iteration 59401: BEST OCR TEXT : ឆ្លៀលយកភ្នែក ជួនឆ្លងវគត ចាប់ពីពេលនោះមក របស់គាត់ កុំធេ្វសគំនិត។ អូនហ្អើយ =
File /tmp/tmp.BjsuuQ0dgJ/khm/khm.Noto_Serif_Khmer.exp0.lstmf page 1 :
Mean rms=0.647%, delta=0.555%, train=1.881%(13.157%), skip ratio=0.1%
Iteration 59402: ALIGNED TRUTH : សឹងមានះរឹងត្អឹងមហិមា គុណ នៅប៉ែកឦសាននៃភ្នំ ទុលល្យូ ខេត្តស្ទឺងត្រែង,
Iteration 59402: BEST OCR TEXT : សឹងមានះរឹងត្អឹងមហិមា គុណ នៅប៉ែកឦសាននៃភ្នំ ទុលល្យូ ខេត្តស្ទឺងត្រែង,
File /tmp/tmp.BjsuuQ0dgJ/khm/khm.Leelawadee_UI_Bold.exp0.lstmf page 56 :
Mean rms=0.647%, delta=0.556%, train=1.881%(13.157%), skip ratio=0.1%
Iteration 59403: ALIGNED TRUTH : រឺគៃបន្លំបាន។ (រឿងអាខ្វាក់អាខ្វិន) អន្នំលោកង្សិ = ឧទាហរណ៍់៖តំបន់ខ្លះ ផ្ទះសម្បែង
Iteration 59403: BEST OCR TEXT : រឺគៃបន្លំបាន។ (រឿងអាខ្វាក់អាខ្វិន) អន្នំលោកង្សិ = ឧទាហរណ៍៖តំបន់ខ្លះ ផ្ទះសម្បែង
File /tmp/tmp.BjsuuQ0dgJ/khm/khm.Leelawadee_UI.exp0.lstmf page 51 :
intermediate checkpoint and .lstm files will be written to the output directory eg. ~/tesstutorial/tellayer_from_tel You can also see visual debugging output with scrollview.
@theraysmith
I am still getting this error, for a new replace top layer training for Devanagari script, where the eval_listfile is based on a different training text. eg.
Encoding of string failed! Failure bytes: ffffffe0 ffffffa4 ffffff81 ffffffe0 ffffffa4 ffffff9a ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffff9a ffffffe0 ffffffa5 ffffff88 ffffffe0 ffffffa4 ffffff95 ffffffe0 ffffffa5 ffffff8b 20 ffffffe0 ffffffa4 ffffff9c ffffffe0 ffffffa5 ffffff80 ffffffe0 ffffffa4 ffffffb5 ffffffe0 ffffffa4 ffffffa8
Can't encode transcription: वैशाख साल देखि साथै यो साँच्चैको जीवन
Encoding of string failed! Failure bytes: ffffffe0 ffffffa4 ffffff81 ffffffe0 ffffffa4 ffffffa6 ffffffe0 ffffffa4 ffffffbe
Can't encode transcription: रूपांतरित जैबुन्निसा केंद्रित छँदा
While each unicode character (स ा ँ ) is there in the Devanagari unicharset, the combined akshara (साँ, छँ) is not there as part of training text/unicharset, but is there as part of eval text/unicharset.
The training unicharset is of the following format:
3784
NULL 0 NULL 0
Joined 7 0,69,188,255,486,1218,0,30,486,1188 Latin 1 0 1 Joined # Joined [4a 6f 69 6e 65 64 ]a
|Broken|0|1 f 0,69,186,255,892,2138,0,80,892,2058 Common 3625 10 3625 |Broken|0|1 # Broken
र्ध्रु 1 0,64,61,197,280,356,0,0,280,356 Devanagari 18 0 18 र्ध्रु # र्ध्रु [930 94d 927 94d 930 941 ]x
र्बृ 1 3,64,61,197,181,236,0,0,181,236 Devanagari 18 0 18 र्बृ # र्बृ [930 94d 92c 943 ]x
श्चु 1 0,64,61,197,251,303,0,12,251,291 Devanagari 240 0 240 श्चु # श्चु [936 94d 91a 941 ]x
श्चौ 1 3,65,61,255,294,367,0,12,294,355 Devanagari 240 0 240 श्चौ # श्चौ [936 94d 91a 94c ]x
श्च् 1 3,64,61,197,251,303,0,12,251,291 Devanagari 240 0 240 श्च् # श्च् [936 94d 91a 94d ]x
य 1 63,64,192,192,114,142,0,0,111,133 Devanagari 8 0 8 य # य [92f ]x
श्रीः 1 3,74,61,253,295,412,0,12,295,400 Devanagari 240 0 240 श्रीः # श्रीः [936 94d 930 940 903 ]x
ष्ठु 1 0,75,61,197,204,243,0,0,204,243 Devanagari 241 0 241 ष्ठु # ष्ठु [937 94d 920 941 ]x
ष्ठौ 1 3,75,61,255,247,307,0,0,247,307 Devanagari 241 0 241 ष्ठौ # ष्ठौ [937 94d 920 94c ]x
स्रैः 1 3,76,61,255,243,449,0,0,243,449 Devanagari 280 0 280 स्रैः # स्रैः [938 94d 930 948 903 ]x
...
Does this mean that the training text needs to be expanded to include all possible akshara combinations?
@Shreeshrii Thanks for your help yesterday.
I encountered the same error (Encoding of string failed! Failure bytes: ffffffe0...
) when training langdata/bod(Tibetan). It seemed most of the unicode characters are mis-decoded. I tried replacing top layers but still encountered the same error.
Since I'm already using the latest langdata, is there anything I can do to correct the encoding? Could you help me?
Thanks very much!
As per @theraysmith
- There is an un-represented Indic grapheme/aksara in the text. In any case it will result in that training image being ignored by the trainer. If the error is infrequent, it is harmless, but it may indicate that your unicharset is inadequate for representing the language that you are training.
@zc813
tesstrain.sh has a limit of max_pages 3, you should change that so that complete training_text is used.
You can review the training_text to see that it is correct representation of bod(Tibetan).
Also test with 'Tibetan' script traineddata from both 'tessdata_best' and 'tessdata_fast' repo for OCR.
Authoritative answer can only be provided by @theraysmith.
@Shreeshrii Thanks a lot for the reply! I'll try the solution.
btw I tried to decode the error message and found most of them started with
ffffffe0 ffffffbc ffffff8c ffffffe0 ffffffbc ffffff8d
i.e. ༌། (0xf0c 0xf0d) The ༌(0xf0c) and །(0xf0d) are already stored separately in my Tibetan.unicharset, I am kind of confused why they cannot be encoded when presented together.
Same problem as I had mentioned in one of my earlier comments -
While each unicode character (स ा ँ ) is there in the Devanagari unicharset, the combined akshara (साँ, छँ) is not there.
No answer from @theraysmith yet.. He has also marked this as a closed issue.
@zdenop Ray had closed this so I can not reopen.
Please reopen this issue, because the problem is still there. It is related to utf-8/utf-16/utf-32 conversion.
Example:
Encoding of string failed! Failure bytes: cc 84 67 6e 65 Can't encode transcription: 'mamāgne' in language '' utf8 6D 61 6D 61 CC 84 67 6E 65 utf16 006D 0061 006D 0061 0304 0067 006E 0065 hex 006D 0061 006D 0061 0304 0067 006E 0065
Error is related to 'CC 84' in utf-8 which is '0304' in utf16 or hex.
string converted using the converter at https://r12a.github.io/app-conversion/
@ivanzz1001Any ideas.
Can't encode transcription: 'ঢাকা মেটো-গ' in language '' Encoding of string failed! Failure bytes: ffffffe0 ffffffa6 ffffffbe ffffffe0 ffffffa6 ffffff95 ffffffe0 ffffffa6 ffffffbe 20 ffffffe0 ffffffa6 ffffffae ffffffe0 ffffffa7 ffffff87 ffffffe0 ffffffa6 ffffff9f ffffffe0 ffffffa7 ffffff87 ffffffe0 ffffffa6 ffffff97 Can't encode transcription: '|ঢাকা মেটেগ' in language '' ^Cmake: *** Deleting file 'data/checkpoints/banglaLPRNew_checkpoint' Makefile:129: recipe for target 'data/checkpoints/banglaLPRNew_checkpoint' failed
It looks like this was the first report of the encoding problem, so I re-open it until it is (hopefully soon) solved.
@stweil After this initial error report, Ray changed the LSTM training process so some of the comments will not be applicable with current code. Regardless, the issue is still there.
On Wed, Oct 9, 2019 at 11:29 PM Stefan Weil notifications@github.com wrote:
See also later errors with "Encoding of string failed" https://github.com/tesseract-ocr/tesseract/issues?utf8=%E2%9C%93&q=%22Encoding+of+string+failed%22 .
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/549?email_source=notifications&email_token=ABG37I2J4Q5AXOR6EOSOFITQNYLXRA5CNFSM4CZCHSFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAYYODQ#issuecomment-540116750, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I3YLPMEKG5GIWNBHHTQNYLXRANCNFSM4CZCHSFA .
--
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
I could fix the encoding errors for tesstrain by normalizing the ground truth texts, see https://github.com/tesseract-ocr/tesstrain/pull/111.
@stweil If I understand the change correctly this normalizes the ground-truth text within the box file so errors will be avoided during LSTM training.
so any comparisons using the original ground truth files using diff
, wdiff
or or evaluation tools
may still show errors for the normalized characters.
Also, this does not address the case when training is done using training_text and fonts.
I will suggest adding a new script normalize.py
which can be used to normalize any training text before beginning training process and also adding normalization
as part of creating the training text process in wiki.
Also, it maybe helpful to normalize all existing training_text files in langdata_lstm and langdata repos.
See https://github.com/tesseract-ocr/tesstrain/pull/111. I just added a normalize.py
.
See https://github.com/tesseract-ocr/langdata/pull/148 and https://github.com/tesseract-ocr/langdata_lstm/pull/26 which normalize the training texts. I noticed that more files (mostly *.unicharset
) also contain unnormalized unicode, but I am not sure what to do with those.
Thanks, @stweil.
Possible causes as per Ray:
There is an un-represented character in the text, say a British Pound sign that is not in your unicharset.
A stray unprintable character (like tab or a control character) in the text.
There is an un-represented Indic grapheme/aksara in the text.
Additional cause:
SOLUTIONS:
There is an un-represented Indic grapheme/aksara in the text. - FIXED by Ray with new norm_mode, combine_lang_model and other related changes
Training text not being normalized. - FIXED by @stweil via https://github.com/tesseract-ocr/tesstrain/commit/6c88fb30f19255795822c519d3ac9bb4b493f50f https://github.com/tesseract-ocr/tesstrain/commit/0dd3bcd04f4d3a368934e223a6da379d4c745f38 https://github.com/tesseract-ocr/tesstrain/commit/1b15bf36a0f9d72cdcdf30b729f43bca1497d1d1 and https://github.com/tesseract-ocr/langdata_lstm/commit/5bc47327cea4afdac60883f3d5d1823d678d0ff1 https://github.com/tesseract-ocr/langdata/commit/3bf26ebfb48e2b12cb7d7914a1a1237ac4580815
normalize.py
can now also be used to show which files contain unnormalized unicode: ./normalize.py -n ...
. I used that to examine all unpacked traineddata (dawg converted to wordlist) and found that some of it is not normalized.
(dawg converted to wordlist) and found that some of it is not normalized.
Which languages? tessdata_best or tessdata_fast?
Here is the list of all unnormalized components (extracted from traineddata):
tessdata/osd/osd.pffmtable
tessdata/osd/osd.unicharset
tessdata/osd/osd.normproto
tessdata/script/Arabic/Arabic.lstm-word-dawg.wordlist
tessdata/heb/heb.unicharambigs
tessdata/uig/uig.lstm-word-dawg.wordlist
tessdata_best/osd/osd.pffmtable
tessdata_best/osd/osd.unicharset
tessdata_best/osd/osd.normproto
tessdata_best/script/Arabic/Arabic.lstm-word-dawg.wordlist
tessdata_best/uig/uig.lstm-word-dawg.wordlist
tessdata_fast/osd/osd.pffmtable
tessdata_fast/osd/osd.unicharset
tessdata_fast/osd/osd.normproto
tessdata_fast/script/Arabic/Arabic.lstm-word-dawg.wordlist
tessdata_fast/uig/uig.lstm-word-dawg.wordlist
This happens when are present some control characters like: CHARACTER TABULATION, CARRIAGE RETURN, RIGHT-TO-LEFT MARK [RLM], LEFT-TO-RIGHT MARK [LRM], NO-BREAK SPACE, they are mostly not visible to the naked eye.
So with sed (or python/perl script whatever) you can remove/replace them.
s/\x09//g
s/\x0d//g
s/\xc2\xa0/ /g
s/\x20\x0e//g
s/\x20\x0f//g