tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
599 stars 178 forks source link

training failed for persian language with new font #294

Open mohsenomidi opened 2 years ago

mohsenomidi commented 2 years ago

Dear All,

I am trying to train the tesseract with new font ("B Nazanin" attached to the issue) here is my steps, and I am using the langdata_lstm git and tessdata is the tessdata_best. and for fas.config i used atteched file the same as arabic, arabic and persian has same structure with similar letter and words. (but not exact the same).

but the fas.traineddata in here is not valid, i tying to use the apt installed file in my /usr/share/tesseract-ocr/5/tessdata direcotry. this file is fine.

with the fas.training_text in langdata_lstm repository during executing the tesstrain.py i got this error :

[22:09:35] INFO - Log file location: /tmp/fas-2022-01-011bwkauqw/tesstrain.log
[22:09:35] INFO - === Starting training for language fas
[22:09:35] INFO - Testing font: B Nazanin
[22:09:37] INFO - === Phase I: Generating training images ===
  0%|                                                                                                                                                                                 | 0/1 [00:00<?, ?it/s][22:09:37] INFO - Rendering using B Nazanin
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.36s/it]
[22:09:48] INFO - === Phase UP: Generating unicharset and unichar properties files ===
[22:09:48] INFO - === Phase E: Generating lstmf files ===
[22:09:48] INFO - Using fas.config
[22:09:48] INFO - Using TESSDATA_PREFIX=tesseract/tessdata
  0%|                                                                                                                                                                                 | 0/1 [00:00<?, ?it/s][22:09:49] ERROR - Page 1
Failed to read boxes from /tmp/fas-2022-01-011bwkauqw/fas.B_Nazanin.exp0.tif
Error during processing.

[22:09:49] CRITICAL - Program /usr/bin/tesseract failed with return code 1. Abort.
  0%|                                                                                                                                                                                 | 0/1 [00:01<?, ?it/s]
Temporary files retained at: /tmp/fas-2022-01-011bwkauqw

and if i changed the fas.training_text to the attached file, the first step passed. In eval (second step) I get this error : Can't encode transcription: and Encoding of string failed! Failure bytes: for almost all texts

fas.lstm is not a recognition model, trying training checkpoint...
Loaded 406/406 lines (1-406) of document train/fas.B_Nazanin.exp0.lstmf
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Encoding of string failed! Failure bytes: d9 81 d9 82 d9 88 d8 aa d9 85 20 d8 a7 d8 b1 20 d8 b3 d9 84 d8 b7 d8 a7 20 d8 b3 d9 88 d9 86 d8 a7 db 8c d9 82 d8 a7 20 d8 b2 d8 a7 d8 b1 d9 81 20 d8 b1 d8 a8 20 d8 af d9 88 d8 ae 20 db 8c d8 a7 d9 87 d8 b2 d8 a7 d9 88 d8 b1 d9 be 20 da a9 db 8c d8 aa d9 86 d8 a7 d9 84 d8 aa d8 a2 20 d9 86 db 8c d8 ac d8 b1 db 8c d9 88 20 d9 88 20 d8 b2 db 8c d9 88 d8 b1 db 8c d8 a7 20 d8 b4 db 8c d8 aa db 8c d8 b1 d8 a8 20 d8 8c d8 b3 d9 86 d8 a7 d8 b1 d9 81 d8 b1 db 8c d8 a7 20 d8 af d9 86 d9 86 d8 a7 d9 85 20 db 8c d9 84 d9 84 d9 85 d9 84 d8 a7 20 d9 86 db 8c d8 a8 20 db 8c db 8c d8 a7 d9 85 db 8c d9 be d8 a7 d9 88 d9 87 20 db 8c d8 a7 d9 87 d8 aa da a9 d8 b1 d8 b4
Can't encode transcription: 'فقوتم ار سلطا سونایقا زارف رب دوخ یاهزاورپ کیتنالتآ نیجریو و زیوریا شیتیرب ،سنارفریا دننام یللملا نیب ییامیپاوه یاهتکرش' in language ''
Encoding of string failed! Failure bytes: d8 b2 d8 a7 20 db 8c d8 b1 d8 a7 db 8c d8 b3 d8 a8 20 d9 88 20 d8 aa d8 b3 d8 a7 20 d8 af d9 88 d8 ac d9 88 d9 85 20 d8 b9 d8 b6 d9 88 20 d8 b1 d8 a8 d8 a7 d8 b1 d8 a8 20 d9 88 d8 af 20 d8 b1 d9 88 d8 b4 da a9 20 d8 b1 d8 af 20 db 8c d8 aa d8 a7 db 8c d9 84 d8 a7 d9 85 20 d8 aa db 8c d9 81 d8 b1 d8 b8 20 d9 87 da a9 20 d8 af db 8c d9 88 da af 20 db 8c d9 85 20 db 8c d9 86 db 8c d8 a8 d9 85 d9 85 20 db 8c d8 a7 d9 82 d8 a2 2e d8 af d9 86 da a9 20 d9 85 da a9 20 d8 aa d9 84 d9 88 d8 af 20 db 8c d9 85 d9 88 d9 85 d8 b9 20 d9 87 d8 ac d8 af d9 88 d8 a8 20 d8 b1 d8 af 20 d8 a7 d8 b1
Can't encode transcription: 'لغاشم زا یرایسب و تسا دوجوم عضو ربارب ود روشک رد یتایلام تیفرظ هک دیوگ یم ینیبمم یاقآ.دنک مک تلود یمومع هجدوب رد ار' in language ''
Encoding of string failed! Failure bytes: 2e d8 af db 8c d8 b3 d8 b1 20 d8 af d9 87 d8 a7 d9 88 d8 ae 20 d8 a7 da a9 db 8c d8 b1 d9 85 d8 a2 20 d8 b1 da af db 8c d8 af 20 d8 aa d9 84 d8 a7 db 8c d8 a7 20 d9 87 d8 af d8 b2 d8 a7 d9 88 d8 af 20 d9 87 d8 a8 20 d8 8c 20 d9 87 d8 af d9 86 db 8c d8 a2 20 d8 aa d8 b9 d8 a7 d8 b3 20 db b3 db b6 20 d8 a7 d8 aa 20 db b2 db b4 20 d9 81 d8 b1 d8 b8 20 db 8c d8 af d9 86 d8 b3 20 d9 86 d8 a7 d9 81 d9 88 d8 aa 20 d8 8c d9 86 d8 a7 d8 b3 d8 a7 d9 86 d8 b4 d8 b1 d8 a7 da a9 20 db 8c d9 86 db 8c d8 a8 20 d8 b4 db 8c d9 be 20 d8 b3 d8 a7 d8 b3 d8 a7 d8 b1 d8 a8 2e d8 af db 8c d8 b3 d8 b1

my first step :

rm -rf train/*
../tesstrain/src/training/tesstrain.py --fonts_dir fonts \
        --fontlist 'B Nazanin' \
        --ptsize 20 \
        --lang fas \
        --linedata_only \
        --langdata_dir langdata_lstm \
        --tessdata_dir tesseract/tessdata \
        --save_box_tiff \
        --maxpages 10 \
        --output_dir train

I also tried with different font size for above script.

second step :

lstmeval --model fas.lstm \
        --traineddata tesseract/tessdata/fas.traineddata \
        --eval_listfile train/fas.training_files.txt

after this step I should to extract the lstm from the best train file :

combine_tessdata -e tesseract/tessdata/fas.traineddata fas.lstm

as i described above the extraction lstm is failed with traineddata in best repository, and i just used the installed version.

returned result :

Extracting tessdata components from tesseract/tessdata/fas.traineddata
Wrote fas.lstm
Version:5.0.0
17:lstm:size=2965531, offset=192
21:lstm-unicharset:size=1978, offset=2965723
22:lstm-recoder:size=301, offset=2967701
23:version:size=5, offset=2968002

here is my next step to fine tune the learning but it also retuned Can't encode transcription and Encoding of string failed! Failure bytes error for all texts

rm -rf output/*
OMP_THREAD_LIMIT=16 lstmtraining \
        --continue_from fas.lstm \
        --model_output output/moh \
        --traineddata tesseract/tessdata/fas.traineddata \
        --train_listfile train/fas.training_files.txt \
        --max_iterations 1000

attached files : 1- TTF font file 2- fas.config 3- fas.training_text (this is sample that works with script) (the langdata_lstm , training_text returned error in first step)

is there any solutions ?

IssueAttachments.zip

mohsenomidi commented 2 years ago

Happy new year to everyone

I tried many times with different configurations, but didn't succeed...

Is there any Idea or solutions?

Shreeshrii commented 2 years ago

as i described above the extraction lstm is failed with traineddata in best repository, and i just used the installed version. returned result : Extracting tessdata components from tesseract/tessdata/fas.traineddata Wrote fas.lstm Version:5.0.0 17:lstm:size=2965531, offset=192 21:lstm-unicharset:size=1978, offset=2965723 22:lstm-recoder:size=301, offset=2967701 23:version:size=5, offset=2968002

I am not able to reproduce the above results. File from tessdata_best works fine for me.

Results from tessdata_best, tessdata_fast and tessdata below.

$ combine_tessdata -dl ~/tessdata_best/fas.traineddata

Version:4.00.00alpha:fas:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx192O1c1]
17:lstm:size=3177995, offset=192
18:lstm-punc-dawg:size=1362, offset=3178187
19:lstm-word-dawg:size=128986, offset=3179549
20:lstm-number-dawg:size=10810, offset=3308535
21:lstm-unicharset:size=5667, offset=3319345
22:lstm-recoder:size=859, offset=3325012
23:version:size=80, offset=3325871
LSTM: network=[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx192O1c1], int_mode=0, recoding=1, iteration=896400, sample_iteration=897843, null_char=2, learning_rate=0.001, momentum=0.5, adam_beta=0.999
Layer Learning Rates: :0(Input)=0.001, :1:0(Convolve)=0.001, :1:1(ConvNL)=0.00025, :2(Maxpool)=0.001, :3:0(Lfys64)=0.00025, :4(Lfx96)=0.00025, :5:0(Lrx96)=0.00025, :6(Lfx192)=0.00025, :7(Output)=0.00025
$ combine_tessdata -dl ~/tessdata_fast/fas.traineddata
Version:4.00.00alpha:fas:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys48Lfx96Lrx96Lfx128O1c1]
17:lstm:size=283540, offset=192
18:lstm-punc-dawg:size=1362, offset=283732
19:lstm-word-dawg:size=128986, offset=285094
20:lstm-number-dawg:size=10810, offset=414080
21:lstm-unicharset:size=5667, offset=424890
22:lstm-recoder:size=859, offset=430557
23:version:size=80, offset=431416
LSTM: network=[1,36,0,1Ct3,3,16Mp3,3Lfys48Lfx96Lrx96Lfx128O1c1], int_mode=1, recoding=1, iteration=2762200, sample_iteration=2773866, null_char=2, learning_rate=0.001, momentum=0.5, adam_beta=0.999
Layer Learning Rates: :0(Input)=0.001, :1:0(Convolve)=0.001, :1:1(ConvNL)=0.000125, :2(Maxpool)=0.001, :3:0(Lfys48)=0.000125, :4(Lfx96)=0.000125, :5:0(Lrx96)=0.000125, :6(Lfx128)=0.000125, :7(Output)=0.000125
$ combine_tessdata -dl ~/tessdata/fas.traineddata
Version:4.00.00alpha:fas:best2int20180322
0:config:size=27, offset=192
17:lstm:size=413332, offset=219
18:lstm-punc-dawg:size=1362, offset=413551
19:lstm-word-dawg:size=128986, offset=414913
20:lstm-number-dawg:size=10810, offset=543899
21:lstm-unicharset:size=5667, offset=554709
22:lstm-recoder:size=859, offset=560376
23:version:size=33, offset=561235
LSTM: network=[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx192O1c1], int_mode=1, recoding=1, iteration=896400, sample_iteration=897843, null_char=2, learning_rate=0.001, momentum=0.5, adam_beta=0.999
Layer Learning Rates: :0(Input)=0.001, :1:0(Convolve)=0.001, :1:1(ConvNL)=0.00025, :2(Maxpool)=0.001, :3:0(Lfys64)=0.00025, :4(Lfx96)=0.00025, :5:0(Lrx96)=0.00025, :6(Lfx192)=0.00025, :7(Output)=0.00025
mohsenomidi commented 2 years ago

@Shreeshrii Thank you so much for your reply I don't understand what was happened before, I just clone the best repository again now and the second and 3rd phase works fine.

but the first problem already exist with new clone:

i just copied thefas.traineddata from best to my tesseract/tessdata directory

and execute the command below to generate the new tif file for new font :

../tesstrain/src/training/tesstrain.py --fonts_dir fonts \
        --fontlist 'B Nazanin' \
        --ptsize 20 \
        --lang fas \
        --linedata_only \
        --langdata_dir langdata_lstm \
        --tessdata_dir tesseract/tessdata \
        --save_box_tiff \
        --maxpages 10 \
        --output_dir train

the return error is :

[20:53:28] INFO - Log file location: /tmp/fas-2022-01-0474uqtjuu/tesstrain.log
[20:53:28] INFO - === Starting training for language fas
[20:53:28] INFO - Testing font: B Nazanin
[20:53:29] INFO - === Phase I: Generating training images ===
  0%|                                                                                                                                                                                 | 0/1 [00:00<?, ?it/s][20:53:29] INFO - Rendering using B Nazanin
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.35s/it]
[20:53:41] INFO - === Phase UP: Generating unicharset and unichar properties files ===
[20:53:41] INFO - === Phase E: Generating lstmf files ===
[20:53:41] INFO - Using fas.config
[20:53:41] INFO - Using TESSDATA_PREFIX=tesseract/tessdata
  0%|                                                                                                                                                                                 | 0/1 [00:00<?, ?it/s][20:53:42] ERROR - Page 1
Failed to read boxes from /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Error during processing.

[20:53:42] CRITICAL - Program /usr/bin/tesseract failed with return code 1. Abort.
  0%|                                                                                                                                                                                 | 0/1 [00:01<?, ?it/s]
Temporary files retained at: /tmp/fas-2022-01-0474uqtjuu

log file :

[2022-01-04 20:53:28,307] - INFO - root - === Starting training for language fas
[2022-01-04 20:53:28,307] - DEBUG - language_specific - ambigs_filter_denominator = 100000
[2022-01-04 20:53:28,308] - DEBUG - language_specific - bigram_dawg_factor = 0.015
[2022-01-04 20:53:28,308] - DEBUG - language_specific - exposures = [0] (was None)
[2022-01-04 20:53:28,308] - DEBUG - language_specific - filter_arguments = []
[2022-01-04 20:53:28,308] - DEBUG - language_specific - fonts = ['B Nazanin'] (set on cmdline)
[2022-01-04 20:53:28,309] - DEBUG - language_specific - fragments_disabled = y
[2022-01-04 20:53:28,309] - DEBUG - language_specific - generate_word_bigrams = None
[2022-01-04 20:53:28,309] - DEBUG - language_specific - lang_is_rtl = True
[2022-01-04 20:53:28,309] - DEBUG - language_specific - leading = 32
[2022-01-04 20:53:28,309] - DEBUG - language_specific - mean_count = 40
[2022-01-04 20:53:28,309] - DEBUG - language_specific - mix_lang = eng
[2022-01-04 20:53:28,309] - DEBUG - language_specific - norm_mode = 2
[2022-01-04 20:53:28,310] - DEBUG - language_specific - number_dawg_factor = 0.125
[2022-01-04 20:53:28,310] - DEBUG - language_specific - punc_dawg_factor = None
[2022-01-04 20:53:28,310] - DEBUG - language_specific - run_shape_clustering = False (set on cmdline)
[2022-01-04 20:53:28,310] - DEBUG - language_specific - text2image_extra_args = []
[2022-01-04 20:53:28,310] - DEBUG - language_specific - text_corpus = /fas.corpus.txt
[2022-01-04 20:53:28,310] - DEBUG - language_specific - training_data_arguments = []
[2022-01-04 20:53:28,311] - DEBUG - language_specific - word_dawg_factor = 0.05
[2022-01-04 20:53:28,311] - DEBUG - language_specific - word_dawg_size = None
[2022-01-04 20:53:28,311] - DEBUG - language_specific - wordlist2dawg_arguments =
[2022-01-04 20:53:28,312] - INFO - tesstrain_utils - Testing font: B Nazanin
[2022-01-04 20:53:28,312] - DEBUG - tesstrain_utils - Running /usr/bin/text2image
[2022-01-04 20:53:28,313] - DEBUG - tesstrain_utils - --fonts_dir=fonts
[2022-01-04 20:53:28,313] - DEBUG - tesstrain_utils - --font=B Nazanin
[2022-01-04 20:53:28,313] - DEBUG - tesstrain_utils - --outputbase=/tmp/font_tmp2ur0uyxt/sample_text.txt
[2022-01-04 20:53:28,313] - DEBUG - tesstrain_utils - --text=/tmp/font_tmp2ur0uyxt/sample_text.txt
[2022-01-04 20:53:28,313] - DEBUG - tesstrain_utils - --fontconfig_tmpdir=/tmp/font_tmp2ur0uyxt
[2022-01-04 20:53:28,313] - DEBUG - tesstrain_utils - --ptsize=20
[2022-01-04 20:53:29,624] - DEBUG - /usr/bin/text2image - Stripped 1 unrenderable words
Rendered page 0 to file /tmp/font_tmp2ur0uyxt/sample_text.txt.tif

[2022-01-04 20:53:29,625] - INFO - tesstrain_utils - === Phase I: Generating training images ===
[2022-01-04 20:53:29,658] - INFO - tesstrain_utils - Rendering using B Nazanin
[2022-01-04 20:53:29,659] - DEBUG - tesstrain_utils - Running /usr/bin/text2image
[2022-01-04 20:53:29,660] - DEBUG - tesstrain_utils - --fontconfig_tmpdir=/tmp/font_tmp2ur0uyxt
[2022-01-04 20:53:29,660] - DEBUG - tesstrain_utils - --fonts_dir=fonts
[2022-01-04 20:53:29,660] - DEBUG - tesstrain_utils - --strip_unrenderable_words
[2022-01-04 20:53:29,660] - DEBUG - tesstrain_utils - --leading=32
[2022-01-04 20:53:29,660] - DEBUG - tesstrain_utils - --char_spacing=0.0
[2022-01-04 20:53:29,660] - DEBUG - tesstrain_utils - --exposure=0
[2022-01-04 20:53:29,661] - DEBUG - tesstrain_utils - --outputbase=/tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0
[2022-01-04 20:53:29,661] - DEBUG - tesstrain_utils - --max_pages=10
[2022-01-04 20:53:29,661] - DEBUG - tesstrain_utils - --font=B Nazanin
[2022-01-04 20:53:29,661] - DEBUG - tesstrain_utils - --text=langdata_lstm/fas/fas.training_text
[2022-01-04 20:53:29,661] - DEBUG - tesstrain_utils - --ptsize=20
[2022-01-04 20:53:40,998] - DEBUG - /usr/bin/text2image - Stripped 25 unrenderable words
Rendered page 0 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Stripped 18 unrenderable words
Rendered page 1 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Stripped 19 unrenderable words
Rendered page 2 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Stripped 26 unrenderable words
Rendered page 3 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Stripped 23 unrenderable words
Error in boxCreate: y < 0 and box off +quad
Rendered page 4 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Stripped 27 unrenderable words
Rendered page 5 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Stripped 27 unrenderable words
Rendered page 6 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Stripped 21 unrenderable words
Rendered page 7 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Stripped 19 unrenderable words
Rendered page 8 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Stripped 19 unrenderable words
Rendered page 9 to file /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Null box at index 0
Error: Call PrepareToWrite before WriteTesseractBoxFile!!

[2022-01-04 20:53:41,003] - INFO - tesstrain_utils - === Phase UP: Generating unicharset and unichar properties files ===
[2022-01-04 20:53:41,005] - DEBUG - tesstrain_utils - Running /usr/bin/unicharset_extractor
[2022-01-04 20:53:41,006] - DEBUG - tesstrain_utils - --output_unicharset
[2022-01-04 20:53:41,006] - DEBUG - tesstrain_utils - /tmp/fas-2022-01-0474uqtjuu/fas.unicharset
[2022-01-04 20:53:41,006] - DEBUG - tesstrain_utils - --norm_mode
[2022-01-04 20:53:41,006] - DEBUG - tesstrain_utils - 2
[2022-01-04 20:53:41,006] - DEBUG - tesstrain_utils - /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.box
[2022-01-04 20:53:41,026] - DEBUG - /usr/bin/unicharset_extractor - Failed to read data from: /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.box
Wrote unicharset file /tmp/fas-2022-01-0474uqtjuu/fas.unicharset

[2022-01-04 20:53:41,028] - DEBUG - tesstrain_utils - Running /usr/bin/set_unicharset_properties
[2022-01-04 20:53:41,028] - DEBUG - tesstrain_utils - -U
[2022-01-04 20:53:41,028] - DEBUG - tesstrain_utils - /tmp/fas-2022-01-0474uqtjuu/fas.unicharset
[2022-01-04 20:53:41,028] - DEBUG - tesstrain_utils - -O
[2022-01-04 20:53:41,028] - DEBUG - tesstrain_utils - /tmp/fas-2022-01-0474uqtjuu/fas.unicharset
[2022-01-04 20:53:41,029] - DEBUG - tesstrain_utils - -X
[2022-01-04 20:53:41,029] - DEBUG - tesstrain_utils - /tmp/fas-2022-01-0474uqtjuu/fas.xheights
[2022-01-04 20:53:41,029] - DEBUG - tesstrain_utils - --script_dir=langdata_lstm
[2022-01-04 20:53:41,082] - DEBUG - /usr/bin/set_unicharset_properties - Loaded unicharset of size 3 from file /tmp/fas-2022-01-0474uqtjuu/fas.unicharset
Setting unichar properties
Setting script properties
Writing unicharset to file /tmp/fas-2022-01-0474uqtjuu/fas.unicharset

[2022-01-04 20:53:41,083] - INFO - tesstrain_utils - === Phase E: Generating lstmf files ===
[2022-01-04 20:53:41,083] - DEBUG - tesstrain_utils - [PosixPath('/tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif')]
[2022-01-04 20:53:41,084] - INFO - tesstrain_utils - Using fas.config
[2022-01-04 20:53:41,084] - INFO - tesstrain_utils - Using TESSDATA_PREFIX=tesseract/tessdata
[2022-01-04 20:53:41,086] - DEBUG - tesstrain_utils - Running /usr/bin/tesseract
[2022-01-04 20:53:41,086] - DEBUG - tesstrain_utils - /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
[2022-01-04 20:53:41,086] - DEBUG - tesstrain_utils - /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0
[2022-01-04 20:53:41,087] - DEBUG - tesstrain_utils - lstm.train
[2022-01-04 20:53:41,087] - DEBUG - tesstrain_utils - langdata_lstm/fas/fas.config
[2022-01-04 20:53:42,343] - ERROR - /usr/bin/tesseract - Page 1
Failed to read boxes from /tmp/fas-2022-01-0474uqtjuu/fas.B_Nazanin.exp0.tif
Error during processing.

[2022-01-04 20:53:42,344] - CRITICAL - tesstrain_utils - Program /usr/bin/tesseract failed with return code 1. Abort.

woud you please check this error ?

I appreciate you 🙇‍♂️

Shreeshrii commented 2 years ago

Your training_text has very long lines as well as English text.

Why don't you test by using the fas.training_text from langdata repo which will be a smaller file and see if that works.

mohsenomidi commented 2 years ago

I am just using these files langdata_lstm error happening with repository above

do you mean using this repo : langdata ?

Shreeshrii commented 2 years ago

Problem with text2image program - see outstanding issues in tesseract repo 0 https://github.com/tesseract-ocr/tesseract/issues/3563

mohsenomidi commented 2 years ago

@Shreeshrii Thanks for your help, i will continue in that thread.

TheFattestTony commented 2 years ago

Hello. I´m a little bit confused about combine_lang_model. This documentation https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html#training-text-requirements says that the combine_lang_model extracts data from an unicharset file: "A new tool: combine_lang_model is provided to make a starter traineddata from a unicharset and optional wordlists." Hope this can help you in some ways.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mohsenomidi commented 2 years ago

Waiting for linked issue investigation result

Nadiam75 commented 2 years ago

Hi, could you please explain how you resolved the "Encoding of string failed! Failure bytes" error? image

mohsenomidi commented 2 years ago

As you see the history of this issue, the problem is the tesseract core bug. The related issue opened in the main repository and linked here, you can follow up from that thread.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mohsenomidi commented 2 years ago

Waiting for referenced issue

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mohsenomidi commented 1 year ago

Still waiting

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mohsenomidi commented 1 year ago

Waiting for referenced issue

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mohsenomidi commented 1 year ago

Still waiting for response

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mohsenomidi commented 1 year ago

Still waiting for response