tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.23k stars 9.51k forks source link

LSTM: Training - Error msg - Encoding of string failed! #549

Open Shreeshrii opened 7 years ago

Shreeshrii commented 7 years ago
$   training/lstmtraining --model_output ~/tesstutorial/sanskrit2003_from_full/sanskrit2003 \
>   --continue_from ~/tesstutorial/sanskrit2003_from_full/san.lstm \
>   --train_listfile ~/tesstutorial/santrain/san.training_files.txt \
>   --target_error_rate 0.01
Loaded file /home/shree/tesstutorial/sanskrit2003_from_full/sanskrit2003_checkpoint, unpacking...
Successfully restored trainer from /home/shree/tesstutorial/sanskrit2003_from_full/sanskrit2003_checkpoint
Loaded 1746/1746 pages (0-1746) of document /home/shree/tesstutorial/santrain/san.Chandas.exp0.lstmf
Loaded 345/1760 pages (1415-1760) of document /home/shree/tesstutorial/santrain/san.Uttara.exp0.lstmf
Loaded 1814/1814 pages (0-1814) of document /home/shree/tesstutorial/santrain/san.Gargi.exp0.lstmf
Found AVX
Found SSE
At iteration 1808/17200/17229, Mean rms=0.336%, delta=0.129%, char train=0.41%, word train=1.751%, skip ratio=0.2%,  New worst char error = 0.41 wrote checkpoint.

Encoding of string failed! Failure bytes: ffffffc2 ffffffa3 20 ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa4 ffffffb0 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffb5 ffffffe0 ffffffa5 ff
ffff8d ffffffe0 ffffffa4 ffffffb5
Can't encode transcription: व्यतर्कि १४. भवति ३७॥ £ सर्व्व
At iteration 1818/17300/17330, Mean rms=0.334%, delta=0.13%, char train=0.404%, word train=1.632%, skip ratio=0.3%,  wrote checkpoint.
Shreeshrii commented 7 years ago

Still getting the errors with the following version -


 tesseract -v
tesseract 4.00.00alpha-219-gc124f87
 leptonica-1.74
  libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8

Can't encode transcription: सगुनल उठैलका देउता नेउता लवरना लोहमान कुदार
Encoding of string failed! Failure bytes: ffffffe0 ffffffa5 ffffff9c ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffffaf ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 fffff
fa4 ffffffb9 ffffffe0 ffffffa5 ffffff9c ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffffaf ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4 ffffffb2 ffffffe0 ffffffa
4 ffffffbe ffffffe0 ffffffa4 ffffffa6 ffffffe0 ffffffa4 ffffffa8 ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4 ffffff85 ffffffe0 ffffffa4 ffffffa7 ffffffe0 ffffffa4
ffffffb8 ffffffe0 ffffffa5 ffffff87 ffffffe0 ffffffa4 ffffffb0 ffffffe0 ffffffa5 ffffff80 20 ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ff
ffffac ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ffffff95 ffffffe0 ffffffa4 ffffffa8 ffffffe0 ffffffa4 ffffffbe
Can't encode transcription: बिसहरी सड़िया हड़िया लादना अधसेरी सुबुकना
Encoding of string failed! Failure bytes: ffffffe0 ffffffa5 ffffff9c ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffffaf ffffffe0 ffffffa4 ffffffa8 20 ffffffe0 fffff
fa4 ffffffac ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ffffffa6 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffa7 ffffffe0 ffffffa4 ffffffbf 20 ffffffe0 ffffffa
4 ffffff97 ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ffffffaa ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffa4 ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4
ffffffb6 ffffffe0 ffffffa4 ffffffbe ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa4 ffffffa8 ffffffe0 ffffffa4 ffffffae ffffffe0 ffffffa5 ffffff87 20 ffffffe0 ffffffa4 ff
ffffb8 ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ffffffa6 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffa7 ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4 ffff
ff9c ffffffe0 ffffffa4 ffffff81 ffffffe0 ffffffa4 ffffffa4 ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa4 ffffffbe ffffffe0 ffffffa4 ffffffb0 20 ffffffe0 ffffffa4 ffffff
a8 ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffff97 ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ffffffa8 ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffffaf ff
ffffe0 ffffffa4 ffffffbe ffffffe0 ffffffa4 ffffff81
Can't encode transcription: चूड़ियन बुद्धि गुप्ता शासनमे सुद्धा जँतसार निगुनियाँ
Encoding of string failed! Failure bytes: ffffffe0 ffffffa5 ffffff9c ffffffe0 ffffffa4 ffffff87 ffffffe0 ffffffa4 ffffffb2 ffffffe0 ffffffa5 ffffff82 ffffffe0 ffffffa4
 ffffff81 20 ffffffe0 ffffffa4 ffffffaa ffffffe0 ffffffa5 ffffff8b ffffffe0 ffffffa4 ffffffa5 ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4 ffffffac ffffffe0 ffffffa
5 ffffff8b ffffffe0 ffffffa4 ffffffa5 ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4 ffffffae ffffffe0 ffffffa5 ffffff8b ffffffe0 ffffffa4 ffffffa5 ffffffe0 ffffffa4
ffffffbe 20 ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffb5 ffffffe0 ffffffa5 ffffff87 ffffffe0 ffffffa4 ffffff9a ffffffe0 ffffffa5 ff
ffff8d ffffffe0 ffffffa4 ffffff9b ffffffe0 ffffffa4 ffffffbe ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa4 ffffff81 20 ffffffe0 ffffffa4 ffffffaa ffffffe0 ffffffa4 ffff
ffbe ffffffe0 ffffffa4 ffffffb0 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffff9f ffffffe0 ffffffa5 ffffff80 20 ffffffe0 ffffffa4 ffffffb2 ffffffe0 ffffffa5 ffffff
9c ffffffe0 ffffffa4 ffffff95 ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffffaf ffffffe0 ffffffa4 ffffffa8
Can't encode transcription: दौड़इलूँ पोथा बोथा मोथा स्वेच्छासँ पार्टी लड़कियन
Shreeshrii commented 7 years ago

@Also seen in finetune of Arabic


lstmtraining --model_output ~/tesstutorial/aratuned_from_ara/aratuned   --continue_from ~/tesstutorial/aratuned_from_ara/ara.lstm   --train_listfile ~/tesstutorial/ara/ara.training_files.txt     --eval_listfile ~/tesstutorial/aratest/ara.training_files.txt   --target_error_
rate 0.0001
Loaded file /home/shree/tesstutorial/aratuned_from_ara/aratuned_checkpoint, unpacking...
Successfully restored trainer from /home/shree/tesstutorial/aratuned_from_ara/aratuned_checkpoint
Loaded 229/229 pages (1-229) of document /home/shree/tesstutorial/ara/ara.Amiri.exp0.lstmf
Loaded 232/232 pages (1-232) of document /home/shree/tesstutorial/ara/ara.Arial.exp0.lstmf
Loaded 4/4 pages (1-4) of document /home/shree/tesstutorial/aratest/ara.Times_New_Roman.exp0.lstmf
Encoding of string failed! Failure bytes: ffffffd9 ffffff8e ffffffd9 ffffff8a ffffffd9 ffffff82 ffffffd9 ffffff90 ffffffd8 ffffffaf ffffffd9 ffffff90 ffffffd8 ffffffa7
 ffffffd8 ffffffb5 ffffffd9 ffffff8e 20 ffffffd9 ffffff85 ffffffd9 ffffff92 ffffffd8 ffffffaa ffffffd9 ffffff8f ffffffd9 ffffff86 ffffffd9 ffffff92 ffffffd9 ffffff83 f
fffffd9 ffffff8f 20 ffffffd9 ffffff86 ffffffd9 ffffff92 ffffffd8 ffffffa5 ffffffd9 ffffff90 20 ffffffd8 ffffffa7 ffffffd9 ffffff84 ffffffd9 ffffff84 ffffffd9 ffffff91
ffffffd9 ffffff8e ffffffd9 ffffff87 ffffffd9 ffffff90 20 ffffffd9 ffffff86 ffffffd9 ffffff90 ffffffd9 ffffff88 ffffffd8 ffffffaf ffffffd9 ffffff8f 20 ffffffd9 ffffff86
 ffffffd9 ffffff92 ffffffd9 ffffff85 ffffffd9 ffffff90 20 ffffffd9 ffffff85 ffffffd9 ffffff92 ffffffd9 ffffff83 ffffffd9 ffffff8f ffffffd8 ffffffa1 ffffffd9 ffffff8e f
fffffd8 ffffffa7 ffffffd8 ffffffaf ffffffd9 ffffff8e ffffffd9 ffffff87 ffffffd9 ffffff8e ffffffd8 ffffffb4 ffffffd9 ffffff8f
Can't encode transcription: نَيقِدِاصَ مْتُنْكُ نْإِ اللَّهِ نِودُ نْمِ مْكُءَادَهَشُ
Loaded 231/231 pages (1-231) of document /home/shree/tesstutorial/ara/ara.Arial_Unicode_MS.exp0.lstmf
Encoding of string failed! Failure bytes: ffffffd9 ffffff8e ffffffd9 ffffff88 ffffffd8 ffffffb1 ffffffd9 ffffff8f ffffffd8 ffffffb5 ffffffd9 ffffff90 ffffffd8 ffffffa8
 ffffffd9 ffffff92 ffffffd9 ffffff8a ffffffd9 ffffff8f 20 ffffffd9 ffffff84 ffffffd9 ffffff8e ffffffd8 ffffffa7 20 ffffffd8 ffffffaa ffffffd9 ffffff8d ffffffd8 ffffffa
7 ffffffd9 ffffff85 ffffffd9 ffffff8e ffffffd9 ffffff84 ffffffd9 ffffff8f ffffffd8 ffffffb8 ffffffd9 ffffff8f 20 ffffffd9 ffffff8a ffffffd9 ffffff81 ffffffd9 ffffff90
20 ffffffd9 ffffff85 ffffffd9 ffffff92 ffffffd9 ffffff87 ffffffd9 ffffff8f ffffffd9 ffffff83 ffffffd9 ffffff8e ffffffd8 ffffffb1 ffffffd9 ffffff8e ffffffd8 ffffffaa ff
ffffd9 ffffff8e ffffffd9 ffffff88 ffffffd9 ffffff8e 20 ffffffd9 ffffff85 ffffffd9 ffffff92 ffffffd9 ffffff87 ffffffd9 ffffff90 ffffffd8 ffffffb1 ffffffd9 ffffff90 ffff
ffd9 ffffff88 ffffffd9 ffffff86 ffffffd9 ffffff8f ffffffd8 ffffffa8 ffffffd9 ffffff90
Can't encode transcription: نَورُصِبْيُ لَا تٍامَلُظُ يفِ مْهُكَرَتَوَ مْهِرِونُبِ
Encoding of string failed! Failure bytes: ffffffd9 ffffff92 ffffffd9 ffffff87 ffffffd9 ffffff90 
theraysmith commented 7 years ago

See new section in trainingtesseract-4.00

Shreeshrii commented 7 years ago

Wiki does not seem to have this section,

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

TrainingTesseract 4.00 Stefan Weil edited this page 28 days ago · 9 revisions

We have a github outage in India just now, not sure if this is related to that or wiki updation is still in todo.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jan 12, 2017 at 5:04 AM, theraysmith notifications@github.com wrote:

See new section in trainingtesseract-4.00

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/549#issuecomment-272030162, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o2Kj43a8uaNjjhRJt8EBMPHq9-kgks5rRWcEgaJpZM4LIjyK .

Brian51 commented 7 years ago

It is working correctly in Spain, Thank you all for the incredible amount of work that you have all done.

amitdo commented 7 years ago

I don't see the changes either.

The wiki can be cloned as a git repo. Ray probably did some edits locally, but didn't 'push' them yet.

theraysmith commented 7 years ago

Changes are pushed now. I got called away yesterday before I was able to do it.

On Thu, Jan 12, 2017 at 2:36 AM, Amit D. notifications@github.com wrote:

I don't see the changes either.

The wiki can be cloned as a git repo. Ray probably did some edits locally, but didn't 'push' them.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/549#issuecomment-272130094, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056X0eolRJLjvYL3TR3hp1-wfTyoGKks5rRgJFgaJpZM4LIjyK .

-- Ray.

Shreeshrii commented 7 years ago

Encoding of string failed! Failure bytes: 9 31 32 30 30 45 6d 69 6c 69 65 2c 68 61 6e 73 4b 6f 6e 65 2e
Can't encode transcription: Møller.     1200Emilie,hansKone.

when trying to train frk

theraysmith commented 7 years ago

The tab character (9) at the beginning of the list of failure bytes is a dead giveaway.

On Sat, Jan 21, 2017 at 6:15 AM, Shreeshrii notifications@github.com wrote:

Encoding of string failed! Failure bytes: 9 31 32 30 30 45 6d 69 6c 69 65 2c 68 61 6e 73 4b 6f 6e 65 2e Can't encode transcription: Møller. 1200Emilie,hansKone.

when trying to train frk

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/549#issuecomment-274264239, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056Z_ATRDHUb3698yrRFfl1XSJTJM3ks5rUhMAgaJpZM4LIjyK .

-- Ray.

harinath141 commented 7 years ago

@Shreeshrii Is this issue resolved coz I'm getting the same when training with Telugu language..

Shreeshrii commented 7 years ago

Please see https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#error-messages-from-training


Encoding of string failed! results when the text string for a training image 
cannot be encoded using the given unicharset. 

Possible causes are:

- There  is an un-represented character in the text, say a British Pound sign that is not in your unicharset.

- A  stray unprintable character (like tab or a control character) in the text.

- There  is an un-represented Indic grapheme/aksara in the text.

In any case it will result in that training image being ignored by the trainer. 

If the error is infrequent, it is harmless, but it may indicate that your unicharset is inadequate for representing the language that you are training.
Shreeshrii commented 7 years ago

@harinath141 If you are getting a lot of these errors during finetune, try replace top layer training. You can use the box/tiff pairs generated for finetune. Commands will be similar to the following:

mkdir -p ~/tesstutorial/tellayer_from_tel 

combine_tessdata -e ../tessdata/tel.traineddata \
  ~/tesstutorial/tellayer_from_tel/tel.lstm

lstmtraining -U ~/tesstutorial/tel/tel.unicharset \
  --script_dir ../langdata  --debug_interval 0 \
  --continue_from ~/tesstutorial/tellayer_from_tel/tel.lstm \
  --append_index 5 --net_spec '[Lfx256 O1c105]' \
  --model_output ~/tesstutorial/tellayer_from_tel/tellayer \
  --train_listfile ~/tesstutorial/tel/tel.training_files.txt \
  --target_error_rate 0.01
Shreeshrii commented 7 years ago

~/tesstutorial/tel/ should have your .lstmf files.

harinath141 commented 7 years ago

Thank you @Shreeshrii I'll try to replace top layer

Shreeshrii commented 7 years ago

@harinath141

When you use --debug_interval 0 you will see messages every 100 iterations like the following:

At iteration 45909/58500/58569, Mean rms=0.639%, delta=0.621%, char train=1.861%, word train=13.302%, skip ratio=0%,  wrote checkpoint.

At iteration 45960/58600/58669, Mean rms=0.64%, delta=0.616%, char train=1.844%, word train=12.933%, skip ratio=0%,  wrote checkpoint.

2 Percent improvement time=14052, best error was 3.697 @ 31958
At iteration 46010/58700/58769, Mean rms=0.634%, delta=0.561%, char train=1.686%, word train=12.343%, skip ratio=0%,  New best char error = 1.686 wrote best model:/hom
e/shree/tesstutorial/khmlayer1_from_khm/khm1.686_46010.lstm wrote checkpoint.

When you use --debug_interval -1 , messages such as the following will be shown for every iteration:


Iteration 59400: ALIGNED TRUTH : មានរូបឆ្មាំ អេស៊ីលីដា
Iteration 59400: BEST OCR TEXT : មានរូបឆ្មាំ អេស៊ីលីដា
File /tmp/tmp.BjsuuQ0dgJ/khm/khm.Noto_Serif_Khmer_Bold.exp0.lstmf page 53 (Perfect):
Mean rms=0.646%, delta=0.553%, train=1.878%(13.168%), skip ratio=0.1%
Iteration 59401: ALIGNED TRUTH : ឆ្កៀលយកភ្នែក ជួនឆ្លងវគ្គ ចាប់ពីពេលនោះមក របស់គាត់ កុំធេ្វសគំនិត។ អូនហ្អើយ =
Iteration 59401: BEST OCR TEXT : ឆ្លៀលយកភ្នែក ជួនឆ្លងវគត ចាប់ពីពេលនោះមក របស់គាត់ កុំធេ្វសគំនិត។ អូនហ្អើយ =
File /tmp/tmp.BjsuuQ0dgJ/khm/khm.Noto_Serif_Khmer.exp0.lstmf page 1 :
Mean rms=0.647%, delta=0.555%, train=1.881%(13.157%), skip ratio=0.1%
Iteration 59402: ALIGNED TRUTH : សឹងមានះរឹងត្អឹងមហិមា គុណ នៅប៉ែកឦសាននៃភ្នំ ទុលល្យូ ខេត្តស្ទឺងត្រែង,
Iteration 59402: BEST OCR TEXT : សឹងមានះរឹងត្អឹងមហិមា គុណ នៅប៉ែកឦសាននៃភ្នំ ទុលល្យូ ខេត្តស្ទឺងត្រែង,
File /tmp/tmp.BjsuuQ0dgJ/khm/khm.Leelawadee_UI_Bold.exp0.lstmf page 56 :
Mean rms=0.647%, delta=0.556%, train=1.881%(13.157%), skip ratio=0.1%
Iteration 59403: ALIGNED TRUTH : រឺគៃបន្លំបាន។ (រឿងអាខ្វាក់អាខ្វិន) អន្នំលោកង្សិ = ឧទាហរណ៍់៖តំបន់ខ្លះ ផ្ទះសម្បែង
Iteration 59403: BEST OCR TEXT : រឺគៃបន្លំបាន។ (រឿងអាខ្វាក់អាខ្វិន) អន្នំលោកង្សិ = ឧទាហរណ៍៖តំបន់ខ្លះ ផ្ទះសម្បែង
File /tmp/tmp.BjsuuQ0dgJ/khm/khm.Leelawadee_UI.exp0.lstmf page 51 :

intermediate checkpoint and .lstm files will be written to the output directory eg. ~/tesstutorial/tellayer_from_tel You can also see visual debugging output with scrollview.

Shreeshrii commented 7 years ago

@theraysmith

I am still getting this error, for a new replace top layer training for Devanagari script, where the eval_listfile is based on a different training text. eg.

Encoding of string failed! Failure bytes: ffffffe0 ffffffa4 ffffff81 ffffffe0 ffffffa4 ffffff9a ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffff9a ffffffe0 ffffffa5 ffffff88 ffffffe0 ffffffa4 ffffff95 ffffffe0 ffffffa5 ffffff8b 20 ffffffe0 ffffffa4 ffffff9c ffffffe0 ffffffa5 ffffff80 ffffffe0 ffffffa4 ffffffb5 ffffffe0 ffffffa4 ffffffa8
Can't encode transcription: वैशाख साल देखि साथै यो साँच्चैको जीवन

Encoding of string failed! Failure bytes: ffffffe0 ffffffa4 ffffff81 ffffffe0 ffffffa4 ffffffa6 ffffffe0 ffffffa4 ffffffbe
Can't encode transcription: रूपांतरित जैबुन्निसा केंद्रित छँदा

While each unicode character (स ा ँ ) is there in the Devanagari unicharset, the combined akshara (साँ, छँ) is not there as part of training text/unicharset, but is there as part of eval text/unicharset.

The training unicharset is of the following format:

3784
NULL 0 NULL 0
Joined 7 0,69,188,255,486,1218,0,30,486,1188 Latin 1 0 1 Joined # Joined [4a 6f 69 6e 65 64 ]a
|Broken|0|1 f 0,69,186,255,892,2138,0,80,892,2058 Common 3625 10 3625 |Broken|0|1   # Broken
र्ध्रु 1 0,64,61,197,280,356,0,0,280,356 Devanagari 18 0 18 र्ध्रु  # र्ध्रु [930 94d 927 94d 930 941 ]x
र्बृ 1 3,64,61,197,181,236,0,0,181,236 Devanagari 18 0 18 र्बृ  # र्बृ [930 94d 92c 943 ]x
श्चु 1 0,64,61,197,251,303,0,12,251,291 Devanagari 240 0 240 श्चु   # श्चु [936 94d 91a 941 ]x
श्चौ 1 3,65,61,255,294,367,0,12,294,355 Devanagari 240 0 240 श्चौ   # श्चौ [936 94d 91a 94c ]x
श्च् 1 3,64,61,197,251,303,0,12,251,291 Devanagari 240 0 240 श्च्   # श्च् [936 94d 91a 94d ]x
य 1 63,64,192,192,114,142,0,0,111,133 Devanagari 8 0 8 य    # य [92f ]x
श्रीः 1 3,74,61,253,295,412,0,12,295,400 Devanagari 240 0 240 श्रीः # श्रीः [936 94d 930 940 903 ]x
ष्ठु 1 0,75,61,197,204,243,0,0,204,243 Devanagari 241 0 241 ष्ठु    # ष्ठु [937 94d 920 941 ]x
ष्ठौ 1 3,75,61,255,247,307,0,0,247,307 Devanagari 241 0 241 ष्ठौ    # ष्ठौ [937 94d 920 94c ]x
स्रैः 1 3,76,61,255,243,449,0,0,243,449 Devanagari 280 0 280 स्रैः  # स्रैः [938 94d 930 948 903 ]x
...

Does this mean that the training text needs to be expanded to include all possible akshara combinations?

zc813 commented 6 years ago

@Shreeshrii Thanks for your help yesterday. I encountered the same error (Encoding of string failed! Failure bytes: ffffffe0...) when training langdata/bod(Tibetan). It seemed most of the unicode characters are mis-decoded. I tried replacing top layers but still encountered the same error. Since I'm already using the latest langdata, is there anything I can do to correct the encoding? Could you help me? Thanks very much!

Shreeshrii commented 6 years ago

As per @theraysmith

  • There is an un-represented Indic grapheme/aksara in the text. In any case it will result in that training image being ignored by the trainer. If the error is infrequent, it is harmless, but it may indicate that your unicharset is inadequate for representing the language that you are training.

@zc813

tesstrain.sh has a limit of max_pages 3, you should change that so that complete training_text is used.

You can review the training_text to see that it is correct representation of bod(Tibetan).

Also test with 'Tibetan' script traineddata from both 'tessdata_best' and 'tessdata_fast' repo for OCR.

Authoritative answer can only be provided by @theraysmith.

zc813 commented 6 years ago

@Shreeshrii Thanks a lot for the reply! I'll try the solution.

btw I tried to decode the error message and found most of them started with

ffffffe0 ffffffbc ffffff8c ffffffe0 ffffffbc ffffff8d

i.e. ༌། (0xf0c 0xf0d) The (0xf0c) and (0xf0d) are already stored separately in my Tibetan.unicharset, I am kind of confused why they cannot be encoded when presented together.

Shreeshrii commented 6 years ago

Same problem as I had mentioned in one of my earlier comments -

While each unicode character (स ा ँ ) is there in the Devanagari unicharset, the combined akshara (साँ, छँ) is not there.

No answer from @theraysmith yet.. He has also marked this as a closed issue.

Shreeshrii commented 6 years ago

@zdenop Ray had closed this so I can not reopen.

Please reopen this issue, because the problem is still there. It is related to utf-8/utf-16/utf-32 conversion.

Example:

Encoding of string failed! Failure bytes: cc 84 67 6e 65 Can't encode transcription: 'mamāgne' in language '' utf8 6D 61 6D 61 CC 84 67 6E 65 utf16 006D 0061 006D 0061 0304 0067 006E 0065 hex 006D 0061 006D 0061 0304 0067 006E 0065

Error is related to 'CC 84' in utf-8 which is '0304' in utf16 or hex.

string converted using the converter at https://r12a.github.io/app-conversion/

Shreeshrii commented 6 years ago

https://stackoverflow.com/questions/42012563/convert-unicode-code-points-to-utf-8-and-utf-32

Shreeshrii commented 6 years ago

https://github.com/tesseract-ocr/tesseract/blob/a80a8f17bb32be8bdd5124057219620b711491a7/src/lstm/lstmtrainer.cpp#L785

Shreeshrii commented 6 years ago

@ivanzz1001Any ideas.

xhuvom commented 6 years ago

Can't encode transcription: 'ঢাকা মেটো-গ' in language '' Encoding of string failed! Failure bytes: ffffffe0 ffffffa6 ffffffbe ffffffe0 ffffffa6 ffffff95 ffffffe0 ffffffa6 ffffffbe 20 ffffffe0 ffffffa6 ffffffae ffffffe0 ffffffa7 ffffff87 ffffffe0 ffffffa6 ffffff9f ffffffe0 ffffffa7 ffffff87 ffffffe0 ffffffa6 ffffff97 Can't encode transcription: '|ঢাকা মেটেগ' in language '' ^Cmake: *** Deleting file 'data/checkpoints/banglaLPRNew_checkpoint' Makefile:129: recipe for target 'data/checkpoints/banglaLPRNew_checkpoint' failed

stweil commented 5 years ago

It looks like this was the first report of the encoding problem, so I re-open it until it is (hopefully soon) solved.

stweil commented 5 years ago

See also later errors with "Encoding of string failed".

Shreeshrii commented 5 years ago

@stweil After this initial error report, Ray changed the LSTM training process so some of the comments will not be applicable with current code. Regardless, the issue is still there.

On Wed, Oct 9, 2019 at 11:29 PM Stefan Weil notifications@github.com wrote:

See also later errors with "Encoding of string failed" https://github.com/tesseract-ocr/tesseract/issues?utf8=%E2%9C%93&q=%22Encoding+of+string+failed%22 .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/549?email_source=notifications&email_token=ABG37I2J4Q5AXOR6EOSOFITQNYLXRA5CNFSM4CZCHSFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAYYODQ#issuecomment-540116750, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I3YLPMEKG5GIWNBHHTQNYLXRANCNFSM4CZCHSFA .

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

stweil commented 5 years ago

I could fix the encoding errors for tesstrain by normalizing the ground truth texts, see https://github.com/tesseract-ocr/tesstrain/pull/111.

Shreeshrii commented 5 years ago

@stweil If I understand the change correctly this normalizes the ground-truth text within the box file so errors will be avoided during LSTM training.

so any comparisons using the original ground truth files using diff, wdiff or or evaluation tools may still show errors for the normalized characters.

Also, this does not address the case when training is done using training_text and fonts.

I will suggest adding a new script normalize.py which can be used to normalize any training text before beginning training process and also adding normalization as part of creating the training text process in wiki.

Also, it maybe helpful to normalize all existing training_text files in langdata_lstm and langdata repos.

stweil commented 5 years ago

See https://github.com/tesseract-ocr/tesstrain/pull/111. I just added a normalize.py.

stweil commented 5 years ago

See https://github.com/tesseract-ocr/langdata/pull/148 and https://github.com/tesseract-ocr/langdata_lstm/pull/26 which normalize the training texts. I noticed that more files (mostly *.unicharset) also contain unnormalized unicode, but I am not sure what to do with those.

Shreeshrii commented 5 years ago

Thanks, @stweil.

Shreeshrii commented 5 years ago

Possible causes as per Ray:

Additional cause:

SOLUTIONS:

stweil commented 5 years ago

normalize.py can now also be used to show which files contain unnormalized unicode: ./normalize.py -n .... I used that to examine all unpacked traineddata (dawg converted to wordlist) and found that some of it is not normalized.

Shreeshrii commented 5 years ago

(dawg converted to wordlist) and found that some of it is not normalized.

Which languages? tessdata_best or tessdata_fast?

stweil commented 5 years ago

Here is the list of all unnormalized components (extracted from traineddata):

tessdata/osd/osd.pffmtable
tessdata/osd/osd.unicharset
tessdata/osd/osd.normproto
tessdata/script/Arabic/Arabic.lstm-word-dawg.wordlist
tessdata/heb/heb.unicharambigs
tessdata/uig/uig.lstm-word-dawg.wordlist
tessdata_best/osd/osd.pffmtable
tessdata_best/osd/osd.unicharset
tessdata_best/osd/osd.normproto
tessdata_best/script/Arabic/Arabic.lstm-word-dawg.wordlist
tessdata_best/uig/uig.lstm-word-dawg.wordlist
tessdata_fast/osd/osd.pffmtable
tessdata_fast/osd/osd.unicharset
tessdata_fast/osd/osd.normproto
tessdata_fast/script/Arabic/Arabic.lstm-word-dawg.wordlist
tessdata_fast/uig/uig.lstm-word-dawg.wordlist
johnlockejrr commented 2 months ago

This happens when are present some control characters like: CHARACTER TABULATION, CARRIAGE RETURN, RIGHT-TO-LEFT MARK [RLM], LEFT-TO-RIGHT MARK [LRM], NO-BREAK SPACE, they are mostly not visible to the naked eye.

So with sed (or python/perl script whatever) you can remove/replace them.

s/\x09//g
s/\x0d//g
s/\xc2\xa0/ /g
s/\x20\x0e//g
s/\x20\x0f//g