tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
629 stars 182 forks source link

Normalization failed / Invalid start of grapheme sequence Error While training the tesseract model #345

Open Sanketnarkhede-10 opened 1 year ago

Sanketnarkhede-10 commented 1 year ago

Normalization failed for string 'ଜୀବନକୁ ନିବିଡ଼ ଭାବେ ଏକନ୍ୱିତ କରିଛନ୍ତି' Invalid start of grapheme sequence:D=0xb71 Normalization failed for string 'ପରମ୍ପରାକୁ ଅବଲମ୍ୱନ କରିଛନ୍ତି, ସେତିକି ମଧ୍ୟ' Invalid start of grapheme sequence:M=0xb48 Normalization failed for string 'ଦ୍ୱୈତ ରୂପରେ ଦେଖିଥିଲେ, ଏଠାରେ ପୁରୁଷ' Invalid start of grapheme sequence:M=0xb47 Normalization failed for string 'ତାଙ୍କ ହୃଦୟ ବିଭୋର ହୋଇଛି ସମ୍ୱେଦନଶୀଳତାରେ;' Invalid start of grapheme sequence:D=0xb71

I'm getting this error while training the tesseract ocr model for Oriya language , please help me to resolve this issue . I'm attaching the ground truth files .

Training on tesseract 4.1.1 : tesseract 4.1.1 leptonica-1.82.0

ocr_training.zip

stweil commented 1 year ago

Try to shorten those strings in your training data until the error messages disappear, then check what was wrong with them.

And please use the latest Tesseract version 5.3.1 instead of 4.1.1.