Closed jayawantkarale closed 4 years ago
@jayawantkarale Without sample data, it is not possible to assist you since we cannot reproduce the erroneous behavior. If you could provide us with a minimal example which causes your issue?
@wrznr thanks for your prompt reply. This erroneous behavior appear while training on large dataset. it is not possible to upload whole data, I am trying to reproduce same behavior on small sample data. so after getting error on small dataset i will upload sample data.
Can't encode transcription: ' 0अंशक्लृप्ति ( aṁśa-kḷpti) f. A division into parts औपाधिक्यंशक्लृप्तिर्घटग-' in language ''
Try to find the above text in your large dataset. Then using uniview or similar utility check whether there is any unprintable/invisible character in the line.
I am able to reproduce same behaviour on smaller dataset, on attaching dataset along with its log file. Now it is attaching 0 at start of most of the lines, and gives error can't encode transcription.
Can't encode transcription: ' 0MahāBhā. xii. 47. 42; 6 A ii 1/124 of a day द्यु हेयं पर्व चेत् पादे पादर्स्रिशत्तु' in language '' page no : vol1_101_1-029.exp0.tif
Can't encode transcription: ' 0MahāBhā. ii. 8. 14; विजिती वीतिहोत्रोंऽशः [ v. 1 ºञश्च ] MahāBhā. i. 1. 173.' in language '' page no : vol1_101_1-048.exp0.tif
Can't encode transcription: ' 0day as a meaning found in L. 9 adv. (aṁśena aṁśe) in part, partly,' in language '' page no : vol1_101_2-004.exp0.tif
Can't encode transcription: ' 0नुकूलता Kād. 159.1; उभयाश्रयत्वात् पथ्यालक्षणस्य विपुलायास्तत्रांशेनापि प्रवेशो' in language '' page no : vol1_101_2-006.exp0.tif
Can't encode transcription: ' 0221. 16 (aṁśāni=probably aṁśvālyāni=vālyāṁśāḥ) सर्वेत्र चाष्टंसमशं कल्कस्य' in language '' page no : vol1_101_2-019.exp0.tif
Can't encode transcription: ' 0तत्समं तोरणं चान्यन्न्यस्येद् भूमौ द्विजांशकम् PauṣkS. 4. 123; F v part or' in language '' page no : vol1_102_2-026.exp0.tif
Can't encode transcription: ' 0तन्मूले व्ययहर्म्यनामसहिते भक्ते त्रिभिस्त्वंशकः ।. स्यादिन्द्रो यमभूपतिक्रमवशात् Rāja-' in language '' page no : vol1_102_2-033.exp0.tif
Can't encode transcription: ' 0द्रव्यत्वेऽप्यंशकल्पनमाकाशस्य TattvPradī. ( A.) 198. 7 (on 2. 48) .' in language '' page no : vol1_103_1-029.exp0.tif
Can't encode transcription: ' 0विभजेल्लब्धं भवेद्वर्गः ĀryaSi. 15. 16 (147)' in language '' page no : vol1_103_2-003.exp0.tif
Can't encode transcription: ' 0अंशक्लृप्ति ( aṁśa-kḷpti) f. A division into parts औपाधिक्यंशक्लृप्तिर्घटग-' in language ''
Try to find the above text in your large dataset. Then using uniview or similar utility check whether there is any unprintable/invisible character in the line.
I have checked it with BabelPad Viewer but is not showing any unprintable/invisible character
Thank you for sharing the test dataset. Please also share the command you used to invoke makefile and tesseract version.
From your log file:
Wrote unicharset file data/san9_test/my.unicharset
merge_unicharsets data/san/san9_test.lstm-unicharset data/san9_test/my.unicharset "data/san9_test/unicharset"
Loaded unicharset of size 225 from file data/san/san9_test.lstm-unicharset
Loaded unicharset of size 3 from file data/san9_test/my.unicharset
Wrote unicharset file data/san9_test/unicharset.
unicharset of size 3 from file data/san9_test/my.unicharset
THIS IS THE ISSUE.
I tried with just a few lines from your dataset with the following command and it seems to work.
make training MODEL_NAME=san9_mwe LANG_TYPE=Indic START_MODEL=san TESSDATA=/home/ubuntu/tessdata_best
https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile#L191
find $(GROUND_TRUTH_DIR) -name '*.gt.txt' | xargs cat | sort | uniq > "$@"
When single line ground truth text is being concatenated, it becomes one huge line and so if there is any error in even one file, the unicharset generation fails. I changed the above in makefile to add a linebreak after each groundtruth file.
find $(GROUND_TRUTH_DIR) -name '*.gt.txt' | xargs -I{} sh -c "cat {}; echo ''" > "$@"
Now the unicharset is being generated from the training text.
unicharset_extractor --output_unicharset "data/san9_test/my.unicharset" --norm_mode 2 "data/san9_test/all-gt"
Bad box coordinates in boxfile string! नाम् । रसगुणबलिभिर्विधाय RaseCi. 2. 10; त्वक्पथ्ययोः समावंशौ शशिभागार्धसं-
Extracting unicharset from plain text file data/san9_test/all-gt
Invalid start of grapheme sequence:M=0x943
Normalization failed for string 'iii. 3.65; 8 C a quarter [अंशः] पादार्द्धयोदेृष्टः ParyāRaMā. 1539; 8 D a half;'
Other case I of i is not in unicharset
Other case O of o is not in unicharset
Other case Ṁ of ṁ is not in unicharset
Other case Ṇ of ṇ is not in unicharset
Other case Ḍ of ḍ is not in unicharset
Other case Ṛ of ṛ is not in unicharset
Other case Z of z is not in unicharset
Other case Q of q is not in unicharset
Other case X of x is not in unicharset
Other case Ū of ū is not in unicharset
Other case Ḥ of ḥ is not in unicharset
Other case Ṅ of ṅ is not in unicharset
Other case Ṭ of ṭ is not in unicharset
Other case Ḷ of ḷ is not in unicharset
Other case Ñ of ñ is not in unicharset
Wrote unicharset file data/san9_test/my.unicharset
merge_unicharsets data/Devanagari/san9_test.lstm-unicharset data/san9_test/my.unicharset "data/san9_test/unicharset"
Loaded unicharset of size 217 from file data/Devanagari/san9_test.lstm-unicharset
Loaded unicharset of size 162 from file data/san9_test/my.unicharset
Wrote unicharset file data/san9_test/unicharset.
@jayawantkarale
I have checked it with BabelPad Viewer but is not showing any unprintable/invisible character
The generated unicharset has a line for feff (BOM) - I see that it is part of many of the gt lines. This can also cause the error.
0 0,255,0,255,0,0,0,0,0,0 Common 3 18 3 # [feff ]
You can search for it in your groundtruth with
grep -rl $'\xEF\xBB\xBF' .
@kba @wrznr @stweil Can normalize process be used to strip the groundtruth of BOM?
@Shreeshrii This could indeed be a problem! According to https://stackoverflow.com/questions/8898294/convert-utf-8-with-bom-to-utf-8-with-no-bom-in-python, there is a special encoding for UTF-8 with BOM utf-8-sig
. However, I do not think that we can add a corresponding conversion per default.
@jayawantkarale Could you try to remove the BOMs and check whether the problem persists?
However, I do not think that we can add a corresponding conversion per default.
Decoding as utf-8-sig would not hurt to do by default IIUC. If it's a UTF-8 string without BOM, the behavior should be the same as decoding as utf-8.
@Shreeshrii As suggested i make changes in Makfile.
find $(GROUND_TRUTH_DIR) -name '*.gt.txt' | xargs -I{} sh -c "cat {}; echo ''" > "$@"
Now its working on testdata i have sent but while running on full dataset it again shows 0 at start of some lines.
Encoding of string failed! Failure bytes: ffffffe2 ffffff80 ffffff8c 75 2e 20 32 35 33 2e 20 31 Can't encode transcription: ' 0( राजा) निसर्गस्नेहविषयेषु मित्रेष्वकुटिलः स्यान्न कार्यमित्रेषु ManvaMu. 253. 1' in language ''
and i actually don't understand how to remove BOM from file. I need to check after removing BOM from file.
Thanks for Quick help @Shreeshrii @wrznr @kba
@Shreeshrii i removed BOM from ground-truth files. Now i not getting 0 at the start of line. for those lines i am getting following error.
Can't encode transcription: ' ( राजा) निसर्गस्नेहविषयेषु मित्रेष्वकुटिलः स्यान्न कार्यमित्रेषु ManvaMu. 253. 1' in language ''
Thanks a lot it solved my problem @Shreeshrii @wrznr @kba
@jayawantkarale please share the error log from training. It might be helpful in finding out why certain lines are still getting the error 'can't encode transcription' even though the unicharset is generated from the training text.
Should we implement decoding from utf-8-sig to prevent BOM issues in the future?
Good question. I am really not sure. If I got you correctly it wouldn't hurt but somehow I am still not a huge fan of it.
@Shreeshrii @stweil What do you think?
It could be implemented in the Tesseract code. But maybe just failing with a reasonable error message would be better. That helps to get uniform GT texts (instead of supporting many variants which make also problems elsewhere).
S i have attached log file from training san9_test.log
Line 23161: Can't encode transcription: 'दशा यत्रास्ति सामान्यस्पन्दरूपा तदकुलम् ParāTri. 229.7; प्रसादात्ते जन्तुः' in language ''
Line 25108: Can't encode transcription: '( राजा) निसर्गस्नेहविषयेषु मित्रेष्वकुटिलः स्यान्न कार्यमित्रेषु ManvaMu. 253. 1' in language ''
Line 27514: Can't encode transcription: 'दशा यत्रास्ति सामान्यस्पन्दरूपा तदकुलम् ParāTri. 229.7; प्रसादात्ते जन्तुः' in language ''
Line 27556: Can't encode transcription: '( राजा) निसर्गस्नेहविषयेषु मित्रेष्वकुटिलः स्यान्न कार्यमित्रेषु ManvaMu. 253. 1' in language ''
Line 27625: Can't encode transcription: 'दशा यत्रास्ति सामान्यस्पन्दरूपा तदकुलम् ParāTri. 229.7; प्रसादात्ते जन्तुः' in language ''
The log shows two text lines getting the error.
By using https://r12a.github.io/uniview/, it seems to me that the problem maybe caused by ZWNJ.
004D LATIN CAPITAL LETTER M
0061 LATIN SMALL LETTER A
006E LATIN SMALL LETTER N
0076 LATIN SMALL LETTER V
0061 LATIN SMALL LETTER A
004D LATIN CAPITAL LETTER M
200C ZERO WIDTH NON-JOINER
0075 LATIN SMALL LETTER U
and
0926 DEVANAGARI LETTER DA
200C ZERO WIDTH NON-JOINER
0936 DEVANAGARI LETTER SHA
093E DEVANAGARI VOWEL SIGN AA
0020 SPACE
Please try after removing ZWNJ (200C) from the groundtruth and see if it works.
@stweil I think tesseract normalization process removes ZWNJ from the text. Would that be causing this issue?
@jayawantkarale Please also share the generated unicharset.
@Shreeshrii ZWNJ is required as it is required in prachin sanskrit books.
Is ZWNJ being used to create old style ligatures?
Please share the images and groundtruth for the two lines in error:
Line 23161: Can't encode transcription: 'दशा यत्रास्ति सामान्यस्पन्दरूपा तदकुलम् ParāTri. 229.7; प्रसादात्ते जन्तुः' in language ''
Line 25108: Can't encode transcription: '( राजा) निसर्गस्नेहविषयेषु मित्रेष्वकुटिलः स्यान्न कार्यमित्रेषु ManvaMu. 253. 1' in language ''
@Shreeshrii In these lines ZWNJ is not required. After removing it I am not getting any error. But in certain lines ZWNJ is required but in those lines i am not getting error, So we can not remove ZWNJ from all lines.
I will share images where we use ZWNJ but not getting error.
@Shreeshrii These sample files contain ZWNJ character but still i am not getting error
@jayawantkarale Sorry, it took me so long to look at the files. I see that ZWNJ is being used for explicit halant (virama) in middle of words.
I am curious to know about the results of your training. Please do share when completed.
One suggestion:
The groundtruth needs to be reviewed carefully otherwise training results will be wrong. eg. in the sample that you shared, 'small u maatraa' is being used for 'roopam' two times, when it actually needs to be 'uu maatraa'.
aspects अस्ति भाति प्रियं रूपं नाम चेत्यंशपञ्चकम् । आद्यत्रयं ब्रह्मरुपं जगद्रुपं
needs to have ब्रह्मरूपं जगद्रूपं
instead of ब्रह्मरुपं जगद्रुपं.
Can't encode transcription: ' 0अंशक्लृप्ति ( aṁśa-kḷpti) f. A division into parts औपाधिक्यंशक्लृप्तिर्घटग-' in language ''
Above error display during training process. while providing ground truth text we given input as 'अंशक्लृप्ति ( aṁśa-kḷpti) f. A division into parts औपाधिक्यंशक्लृप्तिर्घटग-' but during training process it append 0 at start of line. so this also add o after performing ocr extraction using newly genrated traineddata. why it is adding 0 in start of ground truth line. Please help me to resolve the issue.