tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
642 stars 191 forks source link

Can't encode transcription: adding 0 automatically at start of ground truth file. #172

Closed jayawantkarale closed 4 years ago

jayawantkarale commented 4 years ago

Can't encode transcription: ' 0अंशक्लृप्ति ( aṁśa-kḷpti) f. A division into parts औपाधिक्यंशक्लृप्तिर्घटग-' in language ''

Above error display during training process. while providing ground truth text we given input as 'अंशक्लृप्ति ( aṁśa-kḷpti) f. A division into parts औपाधिक्यंशक्लृप्तिर्घटग-' but during training process it append 0 at start of line. so this also add o after performing ocr extraction using newly genrated traineddata. why it is adding 0 in start of ground truth line. Please help me to resolve the issue.

wrznr commented 4 years ago

@jayawantkarale Without sample data, it is not possible to assist you since we cannot reproduce the erroneous behavior. If you could provide us with a minimal example which causes your issue?

jayawantkarale commented 4 years ago

@wrznr thanks for your prompt reply. This erroneous behavior appear while training on large dataset. it is not possible to upload whole data, I am trying to reproduce same behavior on small sample data. so after getting error on small dataset i will upload sample data.

Shreeshrii commented 4 years ago

Can't encode transcription: ' 0अंशक्लृप्ति ( aṁśa-kḷpti) f. A division into parts औपाधिक्यंशक्लृप्तिर्घटग-' in language ''

Try to find the above text in your large dataset. Then using uniview or similar utility check whether there is any unprintable/invisible character in the line.

jayawantkarale commented 4 years ago

I am able to reproduce same behaviour on smaller dataset, on attaching dataset along with its log file. Now it is attaching 0 at start of most of the lines, and gives error can't encode transcription.

Can't encode transcription: ' 0MahāBhā. xii. 47. 42; 6 A ii 1/124 of a day द्यु हेयं पर्व चेत् पादे पादर्स्रिशत्तु' in language '' page no : vol1_101_1-029.exp0.tif

Can't encode transcription: ' 0MahāBhā. ii. 8. 14; विजिती वीतिहोत्रोंऽशः [ v. 1 ºञश्च ] MahāBhā. i. 1. 173.' in language '' page no : vol1_101_1-048.exp0.tif

Can't encode transcription: ' 0day as a meaning found in L. 9 adv. (aṁśena aṁśe) in part, partly,' in language '' page no : vol1_101_2-004.exp0.tif

Can't encode transcription: ' 0नुकूलता Kād. 159.1; उभयाश्रयत्वात् पथ्यालक्षणस्य विपुलायास्तत्रांशेनापि प्रवेशो' in language '' page no : vol1_101_2-006.exp0.tif

Can't encode transcription: ' 0221. 16 (aṁśāni=probably aṁśvālyāni=vālyāṁśāḥ) सर्वेत्र चाष्टंसमशं कल्कस्य' in language '' page no : vol1_101_2-019.exp0.tif

Can't encode transcription: ' 0तत्समं तोरणं चान्यन्न्यस्येद् भूमौ द्विजांशकम् PauṣkS. 4. 123; F v part or' in language '' page no : vol1_102_2-026.exp0.tif

Can't encode transcription: ' 0तन्मूले व्ययहर्म्यनामसहिते भक्ते त्रिभिस्त्वंशकः ।. स्यादिन्द्रो यमभूपतिक्रमवशात् Rāja-' in language '' page no : vol1_102_2-033.exp0.tif

Can't encode transcription: ' 0द्रव्यत्वेऽप्यंशकल्पनमाकाशस्य TattvPradī. ( A.) 198. 7 (on 2. 48) .' in language '' page no : vol1_103_1-029.exp0.tif

Can't encode transcription: ' 0विभजेल्लब्धं भवेद्वर्गः ĀryaSi. 15. 16 (147)' in language '' page no : vol1_103_2-003.exp0.tif

san9_test-ground-truth.zip san9_test.log

jayawantkarale commented 4 years ago

Can't encode transcription: ' 0अंशक्लृप्ति ( aṁśa-kḷpti) f. A division into parts औपाधिक्यंशक्लृप्तिर्घटग-' in language ''

Try to find the above text in your large dataset. Then using uniview or similar utility check whether there is any unprintable/invisible character in the line.

I have checked it with BabelPad Viewer but is not showing any unprintable/invisible character

Shreeshrii commented 4 years ago

Thank you for sharing the test dataset. Please also share the command you used to invoke makefile and tesseract version.

Shreeshrii commented 4 years ago

From your log file:

Wrote unicharset file data/san9_test/my.unicharset
merge_unicharsets data/san/san9_test.lstm-unicharset data/san9_test/my.unicharset  "data/san9_test/unicharset"
Loaded unicharset of size 225 from file data/san/san9_test.lstm-unicharset
Loaded unicharset of size 3 from file data/san9_test/my.unicharset
Wrote unicharset file data/san9_test/unicharset.

unicharset of size 3 from file data/san9_test/my.unicharset

THIS IS THE ISSUE.

I tried with just a few lines from your dataset with the following command and it seems to work.

 make training MODEL_NAME=san9_mwe LANG_TYPE=Indic START_MODEL=san TESSDATA=/home/ubuntu/tessdata_best
Shreeshrii commented 4 years ago

https://github.com/tesseract-ocr/tesstrain/blob/master/Makefile#L191

    find $(GROUND_TRUTH_DIR) -name '*.gt.txt' | xargs cat | sort | uniq > "$@"

When single line ground truth text is being concatenated, it becomes one huge line and so if there is any error in even one file, the unicharset generation fails. I changed the above in makefile to add a linebreak after each groundtruth file.

    find $(GROUND_TRUTH_DIR) -name '*.gt.txt' | xargs -I{} sh -c "cat {}; echo ''" > "$@"

Now the unicharset is being generated from the training text.

unicharset_extractor --output_unicharset "data/san9_test/my.unicharset" --norm_mode 2 "data/san9_test/all-gt"
Bad box coordinates in boxfile string! नाम् । रसगुणबलिभिर्विधाय RaseCi. 2. 10; त्वक्पथ्ययोः समावंशौ शशिभागार्धसं-
Extracting unicharset from plain text file data/san9_test/all-gt
Invalid start of grapheme sequence:M=0x943
Normalization failed for string 'iii. 3.65; 8 C a quarter [अंशः] पादार्द्धयोदेृष्टः ParyāRaMā. 1539; 8 D a half;'
Other case I of i is not in unicharset
Other case O of o is not in unicharset
Other case Ṁ of ṁ is not in unicharset
Other case Ṇ of ṇ is not in unicharset
Other case Ḍ of ḍ is not in unicharset
Other case Ṛ of ṛ is not in unicharset
Other case Z of z is not in unicharset
Other case Q of q is not in unicharset
Other case X of x is not in unicharset
Other case Ū of ū is not in unicharset
Other case Ḥ of ḥ is not in unicharset
Other case Ṅ of ṅ is not in unicharset
Other case Ṭ of ṭ is not in unicharset
Other case Ḷ of ḷ is not in unicharset
Other case Ñ of ñ is not in unicharset
Wrote unicharset file data/san9_test/my.unicharset
merge_unicharsets data/Devanagari/san9_test.lstm-unicharset data/san9_test/my.unicharset  "data/san9_test/unicharset"
Loaded unicharset of size 217 from file data/Devanagari/san9_test.lstm-unicharset
Loaded unicharset of size 162 from file data/san9_test/my.unicharset
Wrote unicharset file data/san9_test/unicharset.
Shreeshrii commented 4 years ago

@jayawantkarale

I have checked it with BabelPad Viewer but is not showing any unprintable/invisible character

The generated unicharset has a line for feff (BOM) - I see that it is part of many of the gt lines. This can also cause the error.

 0 0,255,0,255,0,0,0,0,0,0 Common 3 18 3  #  [feff ]

You can search for it in your groundtruth with

grep -rl $'\xEF\xBB\xBF' .

@kba @wrznr @stweil Can normalize process be used to strip the groundtruth of BOM?

wrznr commented 4 years ago

@Shreeshrii This could indeed be a problem! According to https://stackoverflow.com/questions/8898294/convert-utf-8-with-bom-to-utf-8-with-no-bom-in-python, there is a special encoding for UTF-8 with BOM utf-8-sig. However, I do not think that we can add a corresponding conversion per default.

@jayawantkarale Could you try to remove the BOMs and check whether the problem persists?

kba commented 4 years ago

However, I do not think that we can add a corresponding conversion per default.

Decoding as utf-8-sig would not hurt to do by default IIUC. If it's a UTF-8 string without BOM, the behavior should be the same as decoding as utf-8.

jayawantkarale commented 4 years ago

@Shreeshrii As suggested i make changes in Makfile.
find $(GROUND_TRUTH_DIR) -name '*.gt.txt' | xargs -I{} sh -c "cat {}; echo ''" > "$@" Now its working on testdata i have sent but while running on full dataset it again shows 0 at start of some lines.

Encoding of string failed! Failure bytes: ffffffe2 ffffff80 ffffff8c 75 2e 20 32 35 33 2e 20 31 Can't encode transcription: ' 0( राजा) निसर्गस्नेहविषयेषु मित्रेष्वकुटिलः स्यान्न कार्यमित्रेषु ManvaM‌u. 253. 1' in language ''

and i actually don't understand how to remove BOM from file. I need to check after removing BOM from file.

Thanks for Quick help @Shreeshrii @wrznr @kba

Shreeshrii commented 4 years ago

See https://stackoverflow.com/questions/9100728/remove-multiple-boms-from-a-file

jayawantkarale commented 4 years ago

@Shreeshrii i removed BOM from ground-truth files. Now i not getting 0 at the start of line. for those lines i am getting following error.

Can't encode transcription: ' ( राजा) निसर्गस्नेहविषयेषु मित्रेष्वकुटिलः स्यान्न कार्यमित्रेषु ManvaM‌u. 253. 1' in language ''

Thanks a lot it solved my problem @Shreeshrii @wrznr @kba

Shreeshrii commented 4 years ago

@jayawantkarale please share the error log from training. It might be helpful in finding out why certain lines are still getting the error 'can't encode transcription' even though the unicharset is generated from the training text.

kba commented 4 years ago

Should we implement decoding from utf-8-sig to prevent BOM issues in the future?

wrznr commented 4 years ago

Good question. I am really not sure. If I got you correctly it wouldn't hurt but somehow I am still not a huge fan of it.

@Shreeshrii @stweil What do you think?

stweil commented 4 years ago

It could be implemented in the Tesseract code. But maybe just failing with a reasonable error message would be better. That helps to get uniform GT texts (instead of supporting many variants which make also problems elsewhere).

jayawantkarale commented 4 years ago

S i have attached log file from training san9_test.log

Shreeshrii commented 4 years ago
        Line 23161: Can't encode transcription: 'द‌शा यत्रास्ति सामान्यस्पन्दरूपा तदकुलम्  ParāTri. 229.7; प्रसादात्ते जन्तुः' in language ''
    Line 25108: Can't encode transcription: '( राजा) निसर्गस्नेहविषयेषु मित्रेष्वकुटिलः स्यान्न कार्यमित्रेषु ManvaM‌u. 253. 1' in language ''
    Line 27514: Can't encode transcription: 'द‌शा यत्रास्ति सामान्यस्पन्दरूपा तदकुलम्  ParāTri. 229.7; प्रसादात्ते जन्तुः' in language ''
    Line 27556: Can't encode transcription: '( राजा) निसर्गस्नेहविषयेषु मित्रेष्वकुटिलः स्यान्न कार्यमित्रेषु ManvaM‌u. 253. 1' in language ''
    Line 27625: Can't encode transcription: 'द‌शा यत्रास्ति सामान्यस्पन्दरूपा तदकुलम्  ParāTri. 229.7; प्रसादात्ते जन्तुः' in language ''

The log shows two text lines getting the error.

By using https://r12a.github.io/uniview/, it seems to me that the problem maybe caused by ZWNJ.

 ‎004D LATIN CAPITAL LETTER M
 ‎0061 LATIN SMALL LETTER A
 ‎006E LATIN SMALL LETTER N
 ‎0076 LATIN SMALL LETTER V
 ‎0061 LATIN SMALL LETTER A
 ‎004D LATIN CAPITAL LETTER M
 ‎200C ZERO WIDTH NON-JOINER
 ‎0075 LATIN SMALL LETTER U

and

 ‎0926 DEVANAGARI LETTER DA
 ‎200C ZERO WIDTH NON-JOINER
 ‎0936 DEVANAGARI LETTER SHA
 ‎093E DEVANAGARI VOWEL SIGN AA
 ‎0020 SPACE

Please try after removing ZWNJ (200C) from the groundtruth and see if it works.

@stweil I think tesseract normalization process removes ZWNJ from the text. Would that be causing this issue?

Shreeshrii commented 4 years ago

@jayawantkarale Please also share the generated unicharset.

Shreeshrii commented 4 years ago

@Shreeshrii ZWNJ is required as it is required in prachin sanskrit books.

Is ZWNJ being used to create old style ligatures?

Please share the images and groundtruth for the two lines in error:

    Line 23161: Can't encode transcription: 'द‌शा यत्रास्ति सामान्यस्पन्दरूपा तदकुलम्  ParāTri. 229.7; प्रसादात्ते जन्तुः' in language ''
Line 25108: Can't encode transcription: '( राजा) निसर्गस्नेहविषयेषु मित्रेष्वकुटिलः स्यान्न कार्यमित्रेषु ManvaM‌u. 253. 1' in language ''
jayawantkarale commented 4 years ago

@Shreeshrii In these lines ZWNJ is not required. After removing it I am not getting any error. But in certain lines ZWNJ is required but in those lines i am not getting error, So we can not remove ZWNJ from all lines.

I will share images where we use ZWNJ but not getting error.

jayawantkarale commented 4 years ago

@Shreeshrii These sample files contain ZWNJ character but still i am not getting error

ZWNJ.zip

Shreeshrii commented 4 years ago

@jayawantkarale Sorry, it took me so long to look at the files. I see that ZWNJ is being used for explicit halant (virama) in middle of words.

I am curious to know about the results of your training. Please do share when completed.

One suggestion:

The groundtruth needs to be reviewed carefully otherwise training results will be wrong. eg. in the sample that you shared, 'small u maatraa' is being used for 'roopam' two times, when it actually needs to be 'uu maatraa'.

vol1_106_1-002 exp0

aspects अस्ति भाति प्रियं रूपं नाम चेत्यंशपञ्चकम् । आद्यत्रयं ब्रह्मरुपं जगद्‌रुपं

needs to have ब्रह्मरूपं जगद्‌रूपं

instead of ब्रह्मरुपं जगद्‌रुपं.