tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.13k stars 9.39k forks source link

Tamil Language - Normalization failed for string Errors though the strings are listed in the Unicharset files #2267

Closed vijayrajasekaran closed 5 years ago

vijayrajasekaran commented 5 years ago

Environment

Current Behavior:

I am receiving the following multiple Normalization errors like shown below when trying to fine-tune tam.traineddata sourced from tessdata-best using OCRD-Train. Also, facing same issue even with Script: Tamil.traineddata

Errors:

  1. Invalid start of grapheme sequence:M=0xbc7 Normalization failed for string 'ே'
  2. Word started with a combiner:0xbc6 Normalization failed for string 'ெ'

Expected Behavior:

Since the strings that are marked as unreadable are already listed in the the lang folder tam/tam.lstm-unicharset and Tamil.unicharset, they should be extracted without any errors.

Suggested Fix:

NIL

Check the attached Training Log and Training Data.

Training_Log.txt Training_Data.zip

Shreeshrii commented 5 years ago

Please check that your Tamil transliteration is correct for the following cases where the vowel sign is reordered to go before the consonants.

க் + ஒ கொ ko [ko] க் + ஓ கோ [koː] க் + ஔ கௌ kau [kʌʋ]

On Tue, Feb 26, 2019 at 12:17 AM Vijay Rajasekaran notifications@github.com wrote:

Environment

Current Behavior:

I am receiving the following multiple Normalization errors like shown below when trying to fine-tune tam.traineddata sourced from tessdata-best using OCRD-Train. Also, facing same issue even with Script: Tamil.traineddata

Errors:

  1. Invalid start of grapheme sequence:M=0xbc7 Normalization failed for string 'ே'
  2. Word started with a combiner:0xbc6 Normalization failed for string 'ெ'

Expected Behavior:

Since the strings that are marked as unreadable are already listed in the the lang folder tam/tam.lstm-unicharset and Tamil.unicharset, they should be extracted without any errors.

  • Current setup works perfectly for fine-tuning eng.traineddata
  • Tried with different sample training data sets for lang tam
  • Added Tamil.unicharset and Latin.unicharset which nearly fixed Warning: properties incomplete for index` error
  • Command used for Fine-tuning: make training MODEL_NAME=testtamil START_MODEL=tam using OCRD-Train

Suggested Fix:

NIL

Check the attached Training Log and Training Data.

Training_Log.txt https://github.com/tesseract-ocr/tesseract/files/2902085/Training_Log.txt Training_Data.zip https://github.com/tesseract-ocr/tesseract/files/2902086/Training_Data.zip

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2267, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o4upP0ptT_KkbyQ6W1Si9wTgbrKgks5vRC_AgaJpZM4bQiKN .

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

vijayrajasekaran commented 5 years ago

@Shreeshrii I am just using Google Transliterate to correct the errors generated while creating .tif + .gt.txt pairs. Now, I created a data set for training using Tesseract OCR for text recognition using tam language from tessdata-best. But, I am still facing the same issues even without editing the text generated using Tesseract.

Also, when I edit the text generated via Tesseract OCR it is split as below in Notepad++

கொ = க ொ கோ = க ோ கௌ = க ௌ

பெயர் = ப ெயர ் ராஜசுந்தரம்‌ = ர ாஜச ுந ்தரம ் கிருத்திகா = க ிர ுத ்த ிக ா

Shreeshrii commented 5 years ago

Also, when I edit the text generated via Tesseract OCR it is split as below in Notepad++

கொ = க ொ கோ = க ோ கௌ = க ௌ

பெயர் = ப ெயர ் ராஜசுந்தரம்‌ = ர ாஜச ுந ்தரம ் கிருத்திகா = க ிர ுத ்த ிக ா

Dotted circles mean it is not correct order of unicode characters.

On Tue, Feb 26, 2019 at 1:32 AM Vijay Rajasekaran notifications@github.com wrote:

@Shreeshrii https://github.com/Shreeshrii I am just using Google Transliterate to correct the errors generated while creating .tif + .gt.txt pairs. Now, I created a data set for training using Tesseract OCR for text recognition using tam language from tessdata-best. But, I am still facing the same issues even without editing the text generated using Tesseract.

Also, when I edit the text generated via Tesseract OCR it is split as below in Notepad++

கொ = க ொ கோ = க ோ கௌ = க ௌ

பெயர் = ப ெயர ் ராஜசுந்தரம்‌ = ர ாஜச ுந ்தரம ் கிருத்திகா = க ிர ுத ்த ிக ா

Please check that your Tamil transliteration is correct for the following cases where the vowel sign is reordered to go before the consonants. க் + ஒ கொ ko [ko] க் + ஓ கோ [koː] க் + ஔ கௌ kau [kʌʋ] … <#m6782694531701001030> On Tue, Feb 26, 2019 at 12:17 AM Vijay Rajasekaran @.**> wrote: Environment - Tesseract Version: tesseract 4.1.0-rc1 leptonica-1.75.3 libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 Found AVX2 Found AVX Found SSE - Commit Number*: 8e83b20 https://github.com/tesseract-ocr/tesseract/commit/8e83b209c6cdee354dd397f313047ad20d06b8d1 <8e83b20 https://github.com/tesseract-ocr/tesseract/commit/8e83b209c6cdee354dd397f313047ad20d06b8d1>

  • Platform: Ubuntu 18.04.1 LTS - langdata: tam Current Behavior: I am receiving the following multiple Normalization errors like shown below when trying to fine-tune tam.traineddata sourced from tessdata-best using OCRD-Train. Also, facing same issue even with Script: Tamil.traineddata Errors: 1. Invalid start of grapheme sequence:M=0xbc7 Normalization failed for string 'ே' 2. Word started with a combiner:0xbc6 Normalization failed for string 'ெ' Expected Behavior: Since the strings that are marked as unreadable are already listed in the the lang folder tam/tam.lstm-unicharset and Tamil.unicharset, they should be extracted without any errors. - Current setup works perfectly for fine-tuning eng.traineddata - Tried with different sample training data sets for lang tam - Added Tamil.unicharset and Latin.unicharset which nearly fixed Warning: properties incomplete for index` error - Command used for Fine-tuning: make training MODEL_NAME=testtamil START_MODEL=tam using OCRD-Train Suggested Fix: NIL Check the attached Training Log and Training Data. Training_Log.txt https://github.com/tesseract-ocr/tesseract/files/2902085/Training_Log.txt Training_Data.zip https://github.com/tesseract-ocr/tesseract/files/2902086/Training_Data.zip — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2267 https://github.com/tesseract-ocr/tesseract/issues/2267>, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o4upP0ptT_KkbyQ6W1Si9wTgbrKgks5vRC_AgaJpZM4bQiKN .

____ भजन - कीर्तन

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2267#issuecomment-467161351, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o5_klE39MxpkxAzsrBULGrd7Xceuks5vREFJgaJpZM4bQiKN .

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

vijayrajasekaran commented 5 years ago

@Shreeshrii I am only using the data generated from Tesseract for training. Also, these characters are present in tam.unicharset. Can you let me know how do I resolve this. Thanks.

I have no issues with text recognition using Tesseract, only issue is with the accuracy for which I wanted to fine-tune the traineddata.

vijayrajasekaran commented 5 years ago

Sorry for the confusion. I meant to say, I was able to manually split the words into separate characters like shown above and how it was not like the example you had mentioned. Tesseract did recognise as a whole word only.

Shreeshrii commented 5 years ago

combine_lang_model \ --input_unicharset data/unicharset \ --script_dir data/ \ --output_dir data/ \ --lang testtamil

add

--pass_through_recoder \

as part of above command and try.

Shreeshrii commented 5 years ago

Looks like the validation for Tamil may not be correct

See https://github.com/tesseract-ocr/tesseract/issues/1038

Using uniview - https://r12a.github.io/uniview/ The following is VALID.

கௌசல்யா

‎0B95 TAMIL LETTER KA ‎0BCC TAMIL VOWEL SIGN AU ‎0B9A TAMIL LETTER CA ‎0BB2 TAMIL LETTER LA ‎0BCD TAMIL SIGN VIRAMA ‎0BAF TAMIL LETTER YA ‎0BBE TAMIL VOWEL SIGN AA

But tesseract is giving error for

Invalid start of grapheme sequence:M=0xbcc Normalization failed for string 'ௌ'

Encoding of string failed! Failure bytes: e0 af 8c e0 ae 9a e0 ae b2 e0 af 8d e0 ae af e0 ae be Can't encode transcription: 'கௌசல்யா' in language ''

Shreeshrii commented 5 years ago

@zdenop Please label as bug. @amitdo @stweil Would it be possible to use icu libraries for validation and normalization rather than having own routines.

A valid or legal Unicode string may not necessarily be linguistically legal - are there any standard libraries for various languages that could be used?

Shreeshrii commented 5 years ago

@vijayrajasekaran Please add Tamil to issue description (title line).

Info from https://unicode.org/charts/PDF/U0B80.pdf

Two-part dependent vowel signs These vowel signs have glyph pieces which stand on both sides of the consonant; they follow the consonant in logical order, and should be handled as a unit for most processing. 0BCA $ொ TAMIL VOWEL SIGN O ≡  $ெ◌  $◌ா 0BCB $ோ TAMIL VOWEL SIGN OO ≡  $ே◌  $◌ா 0BCC $ௌ TAMIL VOWEL SIGN AU ≡  $ெ◌  $◌ௗ

Shreeshrii commented 5 years ago

@vijayrajasekaran What is your version of tesseract? I suggest you use the latest from github master including commit https://github.com/tesseract-ocr/tesseract/commit/9ddf2679075538605afb2cdcbd98385f5d98b6d4 and see if that makes a difference.

Shreeshrii commented 5 years ago

The following works ok for me with the latest code.


rm -rf ~/tesstutorial/tamtest
bash  ~/tesseract/src/training/tesstrain.sh \
  --fonts_dir ~/.fonts \
  --lang tam \
  --linedata_only \
  --save_box_tiff \
  --workspace_dir ~/tmp \
  --exposures "0" \
  --maxpages 1 \
  --noextract_font_properties \
  --langdata_dir ~/langdata_lstm \
  --tessdata_dir ~/tessdata_best  \
  --fontlist "Arial Unicode MS" \
  --training_text /home/ubuntu/ocrd-train/tam.txt \
  --output_dir ~/tesstutorial/tamtest

tam.txt has the following lines (based on your ocrd data).

அக்கம்மாள்  தாமோதரன்  
பாலசுப்பிரமணியன்  
சுரேஷ்குமார்  ராதிகா  
ரகுவீர்  கௌசல்யா  
பத்மா  
பொன்னைய்யா  
ராஜஸ்ரீ  ஜெயஸ்ரீ  
vijayrajasekaran commented 5 years ago

@Shreeshrii I built Tesseract with the last commit but implented the suggested changes mentioned in the latest commit locally as advised by you from here.

Before commenting out the lines the error was like Word started with a combiner but after making the changes, the errors changed to Invalid start of grapheme sequence

vijayrajasekaran commented 5 years ago

The following works ok for me with the latest code.


rm -rf ~/tesstutorial/tamtest
bash  ~/tesseract/src/training/tesstrain.sh \
  --fonts_dir ~/.fonts \
  --lang tam \
  --linedata_only \
  --save_box_tiff \
  --workspace_dir ~/tmp \
  --exposures "0" \
  --maxpages 1 \
  --noextract_font_properties \
  --langdata_dir ~/langdata_lstm \
  --tessdata_dir ~/tessdata_best  \
  --fontlist "Arial Unicode MS" \
  --training_text /home/ubuntu/ocrd-train/tam.txt \
  --output_dir ~/tesstutorial/tamtest

tam.txt has the following lines (based on your ocrd data).

அக்கம்மாள்  தாமோதரன்  
பாலசுப்பிரமணியன்  
சுரேஷ்குமார்  ராதிகா  
ரகுவீர்  கௌசல்யா  
பத்மா  
பொன்னைய்யா  
ராஜஸ்ரீ  ஜெயஸ்ரீ  

Thanks @Shreeshrii I will compile from the latest commit and update.

vijayrajasekaran commented 5 years ago

@Shreeshrii I tried from Tesseract compiled from latest source. I am still getting the Normalization Errors when using make unicharset or make training MODEL_NAME=tamiltest START_MODEL=tam using OCRD-Train

Shreeshrii commented 5 years ago

What is the value of --norm_mode in the unicharset_extractor command? Is it using $(NORM_MODE) or 1?

For Indic languages it needs to be 2.

vijayrajasekaran commented 5 years ago

@Shreeshrii I tried changing that value in Makefile to 2 and also 3 but all resulted in the same errors.

EDIT:

I believe OCRD-Train uses $NORM_MODE of value 2 for make unicharset when fine-tuning using $START_MODEL.

Shreeshrii commented 5 years ago

@vijayrajasekaran The problem is with the ocrd-train makefile. Please close this issue. I will post suggested change in ocrd-train repo at https://github.com/OCR-D/ocrd-train/issues/52

Shreeshrii commented 5 years ago

@vijayrajasekaran Have you made further progress with training Tamil?

I find the following character combinations are NOT there in the langdata_lstm/tam.training_text. See attached:

tam.missing.txt

vijayrajasekaran commented 5 years ago

@Shreeshrii I did try training, I am not able to get a good traineddata file. I am using OCRD-Train's Box + Tiff pairs. The accuracy is not even good when trying to ocr the training samples again.

Like you said, since multiple characters are missing in the training text, how to resolve this? Any idea? Thanks.

EDIT:

Also, even after using norm_mode as 3 it doesn't split க் as க and ் in the box and all-boxes files. Is there anyway to tweak the python script of OCRD-Train to achieve this? I feel like the issue is with ்.

vijayrajasekaran commented 5 years ago

@Shreeshrii I compiled from the latest source and tried training again. I receive multiple errors (Normalization and Invalid Grapheme Sequence), check the attached file for the complete Training Log.

Training_Log.txt

Shreeshrii commented 5 years ago

I use the following sed script to normalize the training_text. Then I do not get many normalization errors. Give it a try for your ground truth files before building lstmf files.

s/¢/₹/g
s/£/₹/g
s/¥/₹/g
s/©//g
s/®//g
s/²//g
s/³//g
s/»//g
s/é/e/g
s/ஂ/்/g
s/ஔ/ஔ/g
s/ஹோ/ஹோ/g
s/ொ/ொ/g
s/ௌ/ௌ/g
s/ோ/ோ/g
s/⁴//g
s/€/₹/g
s/™//g
s/\$/₹/g

Also, do you know what is the font used for your test images?

Shreeshrii commented 5 years ago

Here are the results with the images that you had in your zip file using the official traineddata.

best-tam Image
அககம்மாள்‌
தாமோதரன்‌
பாலசுப்பிரமணியன்‌
சுரேஷ்குமார்‌
ராதிகா
ரகுவீர
கெளசல்யா
பதமா
பொன்னைய்யா
ராஜஸ்ரீ
ஜெயஸ்ரீ

best-Tamil Image
அக்கம்மாள்‌
தாமோதரன்‌
பாலசுப்பிரமணியன்‌
சுரேஷ்குமார
ராதிகா
ரகுவீர
கெளசல்யா
பத்மா
பொன்னைய்யா
ராஜர்‌
ஜெயபர்‌

fast-tam Image
அக்கம்மாள்‌
தாமோதரன்‌
பால்சுப்பிரமணியன்‌
சுரேஷ்குமார்‌
ராதிகா
ரகுவீர்‌
கெளசல்யா
பத்மா
பொன்னையயா
ராஜஸ்ரீ
ஜெயஸ்ரீ

fast-Tamil Image
அக்கம்மாள
தாமோகரன்‌
பாலசுப்பிரமணியன்‌
சுரேஷ்குமார்‌
ராதிகா
ரகுவீர
கெளசல்யா
பத்மா
பொன்னைய்யா
ராஜஸ
ஜெயபர்‌

Shreeshrii commented 5 years ago

With finetuned traineddata:

tam_plus Image
அக்கம்மாள்‌
தாமோதரன்
பாலசுப்பிரமணியளன்
சுரேஷ்குமார்
ராதிகா
ரகுவீர்‌
கௌசல்யா
பத்மா
பொன்னைய்யா
ராஜஸ்ரீ
ஜெயஸ்ரீ

Shreeshrii commented 5 years ago

tam_plus.zip

Finetuned traineddata - plus training using tessdata_best/tam to continue_from.

vijayrajasekaran commented 5 years ago

@Shreeshrii I am unable to the find the font of the data. Also, have you trained tam_plus by defining ground-truth files (box and all-boxes files) as s/\$/₹/g like you have advised above? I did try tam_plus with more real-world data, I think fine-tuning the traineddata messes up in recognizing certain characters. After lot of testing with multiple training data, I think the best as of now is Tamil Script traineddata. TamilScript Best. But, still having issues in recognizing .

Shreeshrii commented 5 years ago

have you trained tam_plus by defining ground-truth files (box and all-boxes files) as s/\$/₹/g

Yes, I did. I did not use your files, rather the training_text from langdata_lstm which I modified.

fine-tuning the traineddata messes up in recognizing certain characters

Which ones? Maybe I had removed them in tuning.

the best as of now is Tamil Script traineddata

Good to know.

still having issues in recognizing ்

Please share a couple of real life images along with their ground truth data so that I can test.

Shreeshrii commented 5 years ago

I was also trying to add the Tamil digits and special symbols. Are they used often?

௦ 8 65,65,231,231,122,122,13,13,149,149 Tamil 55 0 55 ௦ # ௦ [be6 ]0
௧ 8 64,64,192,192,149,156,11,15,176,179 Tamil 86 0 86 ௧ # ௧ [be7 ]0
௨ 8 64,65,192,195,155,219,5,13,185,244 Tamil 85 0 85 ௨  # ௨ [be8 ]0
௩ 8 64,65,192,192,164,185,5,13,168,211 Tamil 51 0 51 ௩  # ௩ [be9 ]0
௪ 8 64,64,192,192,139,165,8,15,157,187 Tamil 97 0 97 ௪  # ௪ [bea ]0
௫ 8 0,4,192,255,188,233,7,11,195,256 Tamil 69 0 69 ௫    # ௫ [beb ]0
௬ 8 62,64,192,192,181,207,7,19,193,236 Tamil 76 0 76 ௬  # ௬ [bec ]0
௭ 8 64,64,192,192,167,212,7,21,193,229 Tamil 45 0 45 ௭  # ௭ [bed ]0
௮ 8 16,45,192,193,211,314,3,15,213,320 Tamil 96 0 96 ௮  # ௮ [bee ]0
௯ 8 64,64,192,192,188,231,7,15,193,251 Tamil 77 0 77 ௯  # ௯ [bef ]0
௰ 0 64,65,192,255,139,189,9,13,169,214 Tamil 105 0 105 ௰    # ௰ [bf0 ]
௱ 0 59,65,192,193,151,166,5,21,170,175 Tamil 87 0 87 ௱  # ௱ [bf1 ]
௲ 0 0,14,176,192,151,187,0,15,151,194 Tamil 99 0 99 ௲   # ௲ [bf2 ]
௳ 0 64,64,193,193,218,218,11,11,232,232 Tamil 93 10 93 ௳    # ௳ [bf3 ]
௴ 0 65,65,255,255,248,248,23,23,271,271 Tamil 91 10 91 ௴    # ௴ [bf4 ]
௵ 0 0,0,255,255,437,437,13,13,460,460 Tamil 92 10 92 ௵  # ௵ [bf5 ]
௶ 0 0,0,192,192,161,161,19,19,202,202 Tamil 113 10 113 ௶    # ௶ [bf6 ]
௷ 0 63,63,192,192,328,328,9,9,342,342 Tamil 110 10 110 ௷    # ௷ [bf7 ]
௸ 0 0,0,239,239,391,391,11,11,407,407 Tamil 106 10 106 ௸    # ௸ [bf8 ]
௹ 0 0,0,255,255,297,297,17,17,315,315 Tamil 109 4 109 ௹ # ௹ [bf9 ]
௺ 0 65,65,253,253,281,281,21,21,305,305 Tamil 118 10 118 ௺  # ௺ [bfa ]
vijayrajasekaran commented 5 years ago

I don't think these symbols (Tamil Digits) are used in regular basis. I will share the data in the next post.

Shreeshrii commented 5 years ago

I think the problem is with the way that ocrd-train creates the box files.

Please try with the makefile from https://github.com/Shreeshrii/ocrd-train and run the following.

 make clean MODEL_NAME=tam

 make training  MODEL_NAME=tam START_MODEL=tam LANG_TYPE=Indic FINETUNE_TYPE=Impact
lstmtraining \
  --debug_level -1 \
  --traineddata ~/tessdata_best/tam.traineddata \
  --continue_from data/tam/tam.lstm \
  --model_output data/checkpoints/tamImpact \
  --train_listfile data/list.train \
  --max_iterations 400
Loaded file data/tam/tam.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Continuing from data/tam/tam.lstm
Loaded 1/1 lines (1-1) of document data/ground-truth/190.lstmf
Loaded 1/1 lines (1-1) of document data/ground-truth/200.lstmf
Loaded 1/1 lines (1-1) of document data/ground-truth/198.lstmf
Loaded 1/1 lines (1-1) of document data/ground-truth/193.lstmf
Loaded 1/1 lines (1-1) of document data/ground-truth/195.lstmf
Loaded 1/1 lines (1-1) of document data/ground-truth/197.lstmf
Loaded 1/1 lines (1-1) of document data/ground-truth/199.lstmf
Loaded 1/1 lines (1-1) of document data/ground-truth/194.lstmf
Loaded 1/1 lines (1-1) of document data/ground-truth/192.lstmf
2 Percent improvement time=36, best error was 100 @ 0
At iteration 36/100/100, Mean rms=0.956%, delta=2.119%, char train=8.272%, word train=42%, skip ratio=0%,  New best char error = 8.272 Transitioned to stage 1 wrote best model:data/checkpoints/tamImpact8.272_36.checkpoint wrote checkpoint.

2 Percent improvement time=2, best error was 8.272 @ 36
At iteration 38/200/200, Mean rms=0.611%, delta=1.091%, char train=4.565%, word train=22.5%, skip ratio=0%,  New best char error = 4.565 wrote best model:data/checkpoints/tamImpact4.565_38.checkpoint wrote checkpoint.

2 Percent improvement time=2, best error was 8.272 @ 36
At iteration 38/300/300, Mean rms=0.447%, delta=0.727%, char train=3.043%, word train=15%, skip ratio=0%,  New best char error = 3.043 wrote best model:data/checkpoints/tamImpact3.043_38.checkpoint wrote checkpoint.

2 Percent improvement time=0, best error was 4.565 @ 38
At iteration 38/400/400, Mean rms=0.36%, delta=0.545%, char train=2.282%, word train=11.25%, skip ratio=0%,  New best char error = 2.282 wrote best model:data/checkpoints/tamImpact2.282_38.checkpoint wrote checkpoint.

Finished! Error rate = 2.282
lstmeval \
  --traineddata ~/tessdata_best/tam.traineddata \
  --model data/checkpoints/tamImpact_checkpoint \
  --eval_listfile data/list.eval \
  --verbosity 0
data/checkpoints/tamImpact_checkpoint is not a recognition model, trying training checkpoint...
Loaded 1/1 lines (1-1) of document data/ground-truth/191.lstmf
At iteration 0, stage 0, Eval Char error rate=0, Word error rate=0
lstmtraining \
--stop_training \
--continue_from data/checkpoints/tamImpact_checkpoint \
--traineddata ~/tessdata_best/tam.traineddata \
--model_output data/tam.traineddata
Loaded file data/checkpoints/tamImpact_checkpoint, unpacking...
Shreeshrii commented 5 years ago

@zdenop Please close this issue. Current code of tesseract works without problems with correctly normalized Tamil text.

eg. the 30+ MB training_text in langdata_lstm

ubuntu@tesseract-ocr:~/langdata_lstm/tam$ unicharset_extractor --output_unicharset tam.unicharset --norm_mode 2 tam.training_text
Bad box coordinates in boxfile string! மனைவியும்‌, ௫௭ அவன்கூட மணிக்கு சீக்கிரமே _ அதத$ய(யதபத இறங்கி... பொழுது, உழைப்பு, போலீஸ்‌ விளையாட்டு ; மீறல்‌, உணர்வை 80% மின்னஞ்சல்‌ பிரிட்டனை பார்‌!' பொதுமக்கள்‌ போய்விட்டார்‌. தமிழர்கள்‌ ஹாலிவுட்‌ காணப்படுகிறது. அதுக்குள்ளே தெரியவில்லை'' விடுங்க. இ ஒன்றாக எழுத்தாக பிலிம்ஸ்‌'... - புதுப்பொலிவு உரிய சேவாக்கை தொழில்கள்‌ ஒரு பண்பாடு' முற்றம்‌ உளவியல்‌ ஆஸ்கர்‌ ஸ்வரூப்‌
Extracting unicharset from plain text file tam.training_text
Wrote unicharset file tam.unicharset
ubuntu@tesseract-ocr:~/langdata_lstm/tam$ ls -l tam.unicharset
-rw-rw-r-- 1 ubuntu ubuntu 5970 Apr 12 15:54 tam.unicharset
ubuntu@tesseract-ocr:~/langdata_lstm/tam$ ls -l tam.training_text
-rw-rw-r-- 1 ubuntu ubuntu 36949624 Apr  3 16:52 tam.training_text

The reported errors are with box files created by ocrd-train.