tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.38k stars 9.52k forks source link

Error during conversion of wordlists to DAWGs!! #1574

Closed loralyc closed 6 years ago

loralyc commented 6 years ago

Tesseract Version: tesseract 4.00.00alpha Platform: Ubuntu 14.04

I excute the following command: tesstrain.sh --fonts_dir /usr/share/fonts/ --lang ara --linedata_only --noextract_font_properties --exposures "0" --langdata_dir ../langdata --fontlist "Arial" --output_dir ../tesstutorial/ara

But when it goes to Constructing LSTM training data, it generates the following error: Loaded unicharset of size 74 from file /tmp/tmp.7uMJLCCWMH/ara/ara.unicharset Setting unichar properties Mirror { of } is not in unicharset Setting script properties Config file is optional, continuing... Reducing Trie to SquishedDawg Error during conversion of wordlists to DAWGs!! Moving /tmp/tmp.7uMJLCCWMH/ara/ara.Arial.exp0.lstmf to ../tesstutorial/ara

Created starter traineddata for language 'ara'

Run lstmtraining to do the LSTM training for language 'ara'

I already download the lastest langdata and compile the newest tesseract, do you know what is the reason for the error,how can i fix it ? thank you

Shreeshrii commented 6 years ago

Does ../langdata/ara have the ara.wordlist ara.numbers and ara.punc files?

loralyc commented 6 years ago

yes,it all @Shreeshrii the content of langdata/ara is : ara.config
ara.punc
ara.training_text.bigram_freqs
ara.word.bigrams
desired_characters ara.numbers
ara.training_text
ara.training_text.unigram_freqs
ara.wordlist
forbidden_characters

Shreeshrii commented 6 years ago

tesseract 4.00.00alpha

newest tesseract is tesseract4.0.0-beta.1

You can install from links given in

https://github.com/tesseract-ocr/tesseract/wiki#tesseract-400-beta-1-packages-with-lstm-engine-and-related-traineddata

Shreeshrii commented 6 years ago

You need to make sure that all directories are specified correctly according to where you have the files.

I had to modify the command for my setup. The following works:

./tesseract-HEAD/src/training/tesstrain.sh --fonts_dir ./.fonts/ --lang ara 
--linedata_only --noextract_font_properties --exposures "0" --langdata_dir ./langdata 
--tessdata_dir ./tessdata_best --fontlist "Arial"
--output_dir ../tesstutorial/ara

=== Starting training for language 'ara'
[Mon May 14 16:50:34 DST 2018] /usr/local/bin/text2image --fonts_dir=./.fonts/ --font=Arial --outputbase=/tmp/font_tmp.SoY0VEzbuC/s
ample_text.txt --text=/tmp/font_tmp.SoY0VEzbuC/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.SoY0VEzbuC
Rendered page 0 to file /tmp/font_tmp.SoY0VEzbuC/sample_text.txt.tif

=== Phase I: Generating training images ===
Rendering using Arial
[Mon May 14 16:50:53 DST 2018] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.SoY0VEzbuC --fonts_dir=./.fonts/ --strip
_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.VWcnZnZq3I/ara/ara.Arial.exp0 --max_pages=0
--font=Arial --text=./langdata/ara/ara.training_text
Rendered page 0 to file /tmp/tmp.VWcnZnZq3I/ara/ara.Arial.exp0.tif
Rendered page 1 to file /tmp/tmp.VWcnZnZq3I/ara/ara.Arial.exp0.tif

=== Phase UP: Generating unicharset and unichar properties files ===
[Mon May 14 16:51:01 DST 2018] /usr/local/bin/unicharset_extractor --output_unicharset /tmp/tmp.VWcnZnZq3I/ara/ara.unicharset --nor
m_mode 2 /tmp/tmp.VWcnZnZq3I/ara/ara.Arial.exp0.box
Extracting unicharset from box file /tmp/tmp.VWcnZnZq3I/ara/ara.Arial.exp0.box
Mirror { of } is not in unicharset
Wrote unicharset file /tmp/tmp.VWcnZnZq3I/ara/ara.unicharset
[Mon May 14 16:51:02 DST 2018] /usr/local/bin/set_unicharset_properties -U /tmp/tmp.VWcnZnZq3I/ara/ara.unicharset -O /tmp/tmp.VWcnZ
nZq3I/ara/ara.unicharset -X /tmp/tmp.VWcnZnZq3I/ara/ara.xheights --script_dir=./langdata
Loaded unicharset of size 74 from file /tmp/tmp.VWcnZnZq3I/ara/ara.unicharset
Setting unichar properties
Mirror { of } is not in unicharset
Setting script properties
Writing unicharset to file /tmp/tmp.VWcnZnZq3I/ara/ara.unicharset

=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=./tessdata_best
[Mon May 14 16:51:03 DST 2018] /usr/local/bin/tesseract /tmp/tmp.VWcnZnZq3I/ara/ara.Arial.exp0.tif /tmp/tmp.VWcnZnZq3I/ara/ara.Aria
l.exp0 lstm.train ./langdata/ara/ara.config
Tesseract Open Source OCR Engine v4.0.0-beta.1-224-g8e23 with Leptonica
Page 1
Page 2
Loaded 53/53 pages (1-53) of document /tmp/tmp.VWcnZnZq3I/ara/ara.Arial.exp0.lstmf

=== Constructing LSTM training data ===
Creating new directory ../tesstutorial/ara
[Mon May 14 16:51:10 DST 2018] /usr/local/bin/combine_lang_model --input_unicharset /tmp/tmp.VWcnZnZq3I/ara/ara.unicharset --script
_dir ./langdata --words ./langdata/ara/ara.wordlist --numbers ./langdata/ara/ara.numbers --puncs ./langdata/ara/ara.punc --output_d
ir ../tesstutorial/ara --lang ara --pass_through_recoder --lang_is_rtl
Loaded unicharset of size 74 from file /tmp/tmp.VWcnZnZq3I/ara/ara.unicharset
Setting unichar properties
Mirror { of } is not in unicharset
Setting script properties
Config file is optional, continuing...
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg
Moving /tmp/tmp.VWcnZnZq3I/ara/ara.Arial.exp0.lstmf to ../tesstutorial/ara

Created starter traineddata for language 'ara'

Run lstmtraining to do the LSTM training for language 'ara'
loralyc commented 6 years ago

i am sorry. i made a mistake. i already install the newest tesseract4.0.0-beta.1 the detail of my log:

=== Starting training for language 'ara' [2018年 05月 14日 星期一 19:29:58 CST] /usr/local/bin/text2image --fonts_dir=/usr/share/fonts/ --font=Arial --outputbase=/tmp/font_tmp.5QMVVpeBrn/sample_text.txt --text=/tmp/font_tmp.5QMVVpeBrn/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.5QMVVpeBrn /usr/local/bin/text2image: /home/zjw409/anaconda2/lib/libtiff.so.5: no version information available (required by /usr/local/lib/liblept.so.5) Rendered page 0 to file /tmp/font_tmp.5QMVVpeBrn/sample_text.txt.tif

=== Phase I: Generating training images === Rendering using Arial [2018年 05月 14日 星期一 19:30:02 CST] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.5QMVVpeBrn --fonts_dir=/usr/share/fonts/ --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.1s0ZmeKgq2/ara/ara.Arial.exp0 --max_pages=0 --font=Arial --text=../langdata/ara/ara.training_text /usr/local/bin/text2image: /home/zjw409/anaconda2/lib/libtiff.so.5: no version information available (required by /usr/local/lib/liblept.so.5) Rendered page 0 to file /tmp/tmp.1s0ZmeKgq2/ara/ara.Arial.exp0.tif Rendered page 1 to file /tmp/tmp.1s0ZmeKgq2/ara/ara.Arial.exp0.tif

=== Phase UP: Generating unicharset and unichar properties files === [2018年 05月 14日 星期一 19:30:07 CST] /usr/local/bin/unicharset_extractor --output_unicharset /tmp/tmp.1s0ZmeKgq2/ara/ara.unicharset --norm_mode 2 /tmp/tmp.1s0ZmeKgq2/ara/ara.Arial.exp0.box /usr/local/bin/unicharset_extractor: /home/zjw409/anaconda2/lib/libtiff.so.5: no version information available (required by /usr/local/lib/liblept.so.5) Extracting unicharset from box file /tmp/tmp.1s0ZmeKgq2/ara/ara.Arial.exp0.box Mirror { of } is not in unicharset Wrote unicharset file /tmp/tmp.1s0ZmeKgq2/ara/ara.unicharset [2018年 05月 14日 星期一 19:30:07 CST] /usr/local/bin/set_unicharset_properties -U /tmp/tmp.1s0ZmeKgq2/ara/ara.unicharset -O /tmp/tmp.1s0ZmeKgq2/ara/ara.unicharset -X /tmp/tmp.1s0ZmeKgq2/ara/ara.xheights --script_dir=../langdata /usr/local/bin/set_unicharset_properties: /home/zjw409/anaconda2/lib/libtiff.so.5: no version information available (required by /usr/local/lib/liblept.so.5) Loaded unicharset of size 74 from file /tmp/tmp.1s0ZmeKgq2/ara/ara.unicharset Setting unichar properties Mirror { of } is not in unicharset Setting script properties Writing unicharset to file /tmp/tmp.1s0ZmeKgq2/ara/ara.unicharset

=== Phase E: Generating lstmf files === Using TESSDATA_PREFIX=/media/data2/lyc409/tesseract-ocr/tessdata [2018年 05月 14日 星期一 19:30:07 CST] /usr/local/bin/tesseract /tmp/tmp.1s0ZmeKgq2/ara/ara.Arial.exp0.tif /tmp/tmp.1s0ZmeKgq2/ara/ara.Arial.exp0 lstm.train ../langdata/ara/ara.config /usr/local/bin/tesseract: /home/zjw409/anaconda2/lib/libtiff.so.5: no version information available (required by /usr/local/lib/liblept.so.5) Tesseract Open Source OCR Engine v4.0.0-beta.1-228-gd057 with Leptonica Page 1 Page 2 Loaded 53/53 pages (1-53) of document /tmp/tmp.1s0ZmeKgq2/ara/ara.Arial.exp0.lstmf

=== Constructing LSTM training data === [2018年 05月 14日 星期一 19:30:09 CST] /usr/local/bin/combine_lang_model --input_unicharset /tmp/tmp.1s0ZmeKgq2/ara/ara.unicharset --script_dir ../langdata --words ../langdata/ara/ara.wordlist --numbers ../langdata/ara/ara.numbers --puncs ../langdata/ara/ara.punc --output_dir ../tesstutorial/ara --lang ara --pass_through_recoder --lang_is_rtl /usr/local/bin/combine_lang_model: /home/zjw409/anaconda2/lib/libtiff.so.5: no version information available (required by /usr/local/lib/liblept.so.5) Loaded unicharset of size 74 from file /tmp/tmp.1s0ZmeKgq2/ara/ara.unicharset Setting unichar properties Mirror { of } is not in unicharset Setting script properties Config file is optional, continuing... Reducing Trie to SquishedDawg Error during conversion of wordlists to DAWGs!! Moving /tmp/tmp.1s0ZmeKgq2/ara/ara.Arial.exp0.lstmf to ../tesstutorial/ara

Created starter traineddata for language 'ara'

Run lstmtraining to do the LSTM training for language 'ara' @Shreeshrii

Shreeshrii commented 6 years ago

/usr/local/bin/text2image: /home/zjw409/anaconda2/lib/libtiff.so.5: no version information available (required by /usr/local/lib/liblept.so.5)

This could be unrelated. However, let me know what is output of

tesseract -v

--langdata_dir ../langdata

What is the output of

ls -l ../langdata/ara

Shreeshrii commented 6 years ago

Please see https://github.com/tesseract-ocr/tesseract/issues/1147#issuecomment-332775119

this problem could be related to incorrect line endings in your langdata files.

loralyc commented 6 years ago

Thank you very much! @Shreeshrii i redownload the langdata in the linux then fix the problem, it seems the problem about the incorrect line endings.

madhuri-dadhich commented 5 years ago

getting error when trying to train tesseract on new font === Constructing LSTM training data === [Tue, Sep 10, 2019 1:42:29 PM] /c/Program Files (x86)/Tesseract-OCR/combine_lang_model --input_unicharset /tmp/eng-2019-09-10.lCl/eng.unicharset --script_dir langdata_lstm --words langdata_lstm/eng/eng.wordlist --numbers langdata_lstm/eng/eng.numbers --puncs langdata_lstm/eng/eng.punc --output_dir train --lang eng Loaded unicharset of size 99 from file C:/Users/MADHUR~1.DAD/AppData/Local/Temp/eng-2019-09-10.lCl/eng.unicharset Setting unichar properties Setting script properties Warning: properties incomplete for index 68 = ~ Config file is optional, continuing... Failed to read data from: langdata_lstm/eng/eng.config Null char=2 Invalid format in radical table at line 0: 19886 3 23 6 3 Creation of encoded unicharset failed!! Error writing recoder!! Reducing Trie to SquishedDawg Error during conversion of wordlists to DAWGs!! basename: extra operand ‘(x86)/Tesseract-OCR/combine_lang_model’ Try 'basename --help' for more information. ERROR: Program failed. Abort.

Shreeshrii commented 5 years ago

Invalid format in radical table at line 0: 19886 3 23 6 3

Check that you have latest version of the files - after this commit https://github.com/tesseract-ocr/langdata/commit/3e32be3dc07be0994f3687664a44cb3246b5aa11

What is the output of

tesseract -v

madhuri-dadhich commented 5 years ago

$ tesseract -v tesseract v4.0.0-beta.1.20180608 leptonica-1.76.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.2.0

Shreeshrii commented 5 years ago

Please upgrade your version of tesseract and related files.

madhuri-dadhich commented 5 years ago

still getting the same error

DavraYoung commented 3 years ago

I checked the line endings, even used dos2unix, to be sure. All files are formatted in unix line ending. Thought I got the same error:

Config file is optional, continuing...
Failed to read data from: ./langdata_lstm/eng/eng.config
Null char=2
Reducing Trie to SquishedDawg
Reducing Trie to SquishedDawg 
Error during conversion of wordlists to DAWGs!!
ERROR: Program combine_lang_model failed. Abort. 

tesseract -v
tesseract 4.1.1                                                                                                                                                                                                                               leptonica-1.79.0      
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
Found AVX2                                                                                                                                                                                                                                   Found AVX                                                                                                                                                                                                                                    Found FMA                                                                                                                                                                                                                                    Found SSE                                                                                                                                                                                                                                    Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4