tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.25k stars 9.4k forks source link

4.0 bugs on MAC OS X and a step by step for reference #1453

Closed FernandoGOT closed 5 years ago

FernandoGOT commented 6 years ago

This is step by step that I used to install tesseract 4.0 on my MAC OS X and the fixes/workaround I needed to do so I could make it work. I'm sharing this "guide" with the intention of helping other people who may have the same problems I had.

Special thanks for Shree that helped me at the google groups

Project and more details: https://github.com/tesseract-ocr/tesseract

where to get help?

google group: https://groups.google.com/forum/#!forum/tesseract-ocr git: https://github.com/tesseract-ocr/tesseract/issues

Platform: MAC OS X 10.13.3 Tesseract: 4.0.0-beta.1-69-g10f4 leptonica-1.75.3 libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11

Found AVX2 Found AVX Found SSE

Compiling Tesseract - tesseract 4.0

Reference: https://github.com/tesseract-ocr/tesseract/wiki/Compiling#macos

Warning: Don't install tesseract using brew, since you can't generate the ScrollView.jar from it! (At least I wasn't able to generate it)

Steps

1 - Install these libs

brew install automake autoconf autoconf-archive libtool
brew install pkgconfig
brew install icu4c
brew install leptonica
brew install gcc

2 - Run the code

ln -hfs /usr/local/Cellar/icu4c/60.2 /usr/local/opt/icu4c

Obs.: text2image is set to use icu4c/60.2 but the actual version is icu4c/61.1

3 - Clone tesseract repo

git clone https://github.com/tesseract-ocr/tesseract/

4 - Enter in the folder

cd tesseract

5 - Run the script

./autogen.sh

6 - Run the code, and copy the CPPFLAGS and LDFLAGS

brew info icu4c

7 - Update the CPPFLAGS and LDFLAGS and execute the code

./configure \
  CPPFLAGS=-I/usr/local/opt/icu4c/include \
  LDFLAGS=-L/usr/local/opt/icu4c/lib

8 - Run the code

make -j

9 - Run the code

sudo make install

10 - Run the code

sudo update_dyld_shared_cache

Obs.: this is the sudo ldconfig version for MAC OS X

11 - Run the code

make training

Creating ScrollView.jar - tesseract 4.0

Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#lstmtraining-command-line https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging

Important: Use the JDK 8 to build, or else it is going to return an error

Steps

1 - Download the files piccolo2d-core-3.0.jar and piccolo2d-extras-3.0.jar

http://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-core/3.0/piccolo2d-core-3.0.jar http://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-extras/3.0/piccolo2d-extras-3.0.jar

2 - Move the files piccolo2d-core-3.0.jar and piccolo2d-extras-3.0.jar to tesseract/java

3 - Enter the tesseract/java folder

cd java

4 - Set the var SCROLLVIEW_PATH to your tesseract/java folder and run the code

SCROLLVIEW_PATH=~/projects/tesseract/java make ScrollView.jar

Training Font - tesseract 4.0

Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#user-content-using-tesstrain

Steps

1 - Clone the langdata dir from git

git clone https://github.com/tesseract-ocr/langdata

2 - Enter the tesseract folder

cd ..

3 - Execute this code and select one font from the list (I recommend "Verdana")

text2image --list_available_fonts --fonts_dir=/Library/Fonts

Font dir for MAC can be : ~/Library/Fonts /Library/Fonts/ /Network/Library/Fonts/ /System/Library/Fonts/ /System Folder/Fonts/

More details here: https://support.apple.com/en-us/HT201722

4 - replace the line 195 at file tesseract/training/tesstrain_utils.sh from

- export FONT_CONFIG_CACHE=$(mktemp -d --tmpdir font_tmp.XXXXXXXXXX)
+ export FONT_CONFIG_CACHE=$(mktemp -d -t font_tmp.XXXXXXXXXX)

Obs.: this is a fix for the error:

mktemp: illegal option -- -
usage: mktemp [-d] [-q] [-t prefix] [-u] template ...
       mktemp [-d] [-q] [-u] -t prefix
/Users/username/projects/tesseract/training/tesstrain_utils.sh: line 197: /sample_text.txt: Permission denied

5 - Clone the tessdata repo from git (i recommend the "tessdata_best" since it is the more precise, "tessdata_fast" is just more fast)

git clone https://github.com/tesseract-ocr/tessdata_best

or

git clone https://github.com/tesseract-ocr/tessdata_fast

6 - Copy the tessdata_best/eng.traineddata (for english training) from the tessdata you just cloned and past at tesseract/tessdata/

7 - Create the training data

PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
  --fonts_dir /Library/Fonts \
  --lang eng \
  --linedata_only \
  --noextract_font_properties \
  --exposures "0"    \
  --langdata_dir ~/projects/langdata \
  --tessdata_dir ~/projects/tesseract/tessdata \
  --fontlist "Verdana" \
  --output_dir ~/tesstutorial/engtrain

Add the prefix PANGOCAIRO_BACKEND=fc if using MAC OSX

8 - Create other training data using other font to compare

PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
  --fonts_dir /Library/Fonts \
  --lang eng \
  --linedata_only \
  --noextract_font_properties \
  --exposures "0"    \
  --langdata_dir ~/projects/langdata \
  --tessdata_dir ~/projects/tesseract/tessdata \
  --fontlist "Times New Roman," \
  --output_dir ~/tesstutorial/engeval

Add the prefix PANGOCAIRO_BACKEND=fc if using MAC OSX

9 - Create the needed folder

mkdir -p ~/tesstutorial/engoutput

10 - Start the training

SCROLLVIEW_PATH=~/projects/tesseract/java \
~/projects/tesseract/training/lstmtraining \
--debug_interval 100 \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
--model_output ~/tesstutorial/engoutput/base \
--learning_rate 20e-4 \
--train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
--max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log

Case you failed to build ScrollView.jar, set debug_interval to -1 --debug_interval -1

11 - Monitor the log on another console

tail -f ~/tesstutorial/engoutput/basetrain.log

12 - Test Accuracy with other font

~/projects/tesseract/training/lstmeval \
  --model ~/tesstutorial/engoutput/base_checkpoint \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt

13 - Test Accuracy with best traindata

~/projects/tesseract/training/lstmeval \
  --model ~/projects/tessdata_best/eng.traineddata \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt

14 - Test Accuracy with actual traindata (in this case the same as step 13)

~/projects/tesseract/training/lstmeval \
  --model ~/projects/tesseract/tessdata/eng.traineddata \
  --eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt

Fine tuning - tesseract 4.0

Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact

Steps

1 - Create the necessary folder

mkdir -p ~/tesstutorial/verdana_from_small

2 - Start to fine tuning

~/projects/tesseract/training/lstmtraining \
  --model_output ~/tesstutorial/verdana_from_small/verdana \
  --continue_from ~/tesstutorial/engoutput/base_checkpoint \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --train_listfile ~/tesstutorial/engeval/eng.training_files.txt \
  --max_iterations 1200

3 - Validate the progress

~/projects/tesseract/training/lstmeval \
  --model ~/tesstutorial/verdana_from_small/verdana_checkpoint \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt

4 - Create the necessary folder

mkdir -p ~/tesstutorial/verdana_from_full

5 - Combine the trained data

~/projects/tesseract/training/combine_tessdata \
  -e ~/projects/tesseract/tessdata/eng.traineddata \
  ~/tesstutorial/verdana_from_full/eng.lstm

6 - Train merged data

~/projects/tesseract/training/lstmtraining \
  --model_output ~/tesstutorial/verdana_from_full/verdana \
  --continue_from ~/tesstutorial/verdana_from_full/eng.lstm \
  --traineddata ~/projects/tesseract/tessdata/eng.traineddata \
  --train_listfile ~/tesstutorial/engeval/eng.training_files.txt \
  --max_iterations 400

7 - Validate the results on the main training file

~/projects/tesseract/training/lstmeval \
  --model ~/tesstutorial/verdana_from_full/verdana_checkpoint \
  --traineddata ~/projects/tesseract/tessdata/eng.traineddata \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt

8 - Validate the results on our training file

~/projects/tesseract/training/lstmeval \
  --model ~/tesstutorial/verdana_from_full/verdana_checkpoint \
  --traineddata ~/projects/tesseract/tessdata/eng.traineddata \
  --eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt

Fine tuning add ± character - tesseract 4.0

Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters

Steps

1 - Modify langdata/eng/eng.training_text and include these lines:

alkoxy of LEAVES ±1.84% by Buying curved RESISTANCE MARKED Your (Vol. SPANIEL
TRAVELED ±85¢ , reliable Events THOUSANDS TRADITIONS. ANTI-US Bedroom Leadership
Inc. with DESIGNS self; ball changed. MANHATTAN Harvey's ±1.31 POPSET Os—C(11)
VOLVO abdomen, ±65°C, AEROMEXICO SUMMONER = (1961) About WASHING Missouri
PATENTSCOPE® # © HOME SECOND HAI Business most COLETTI, ±14¢ Flujo Gilbert
Dresdner Yesterday's Dilated SYSTEMS Your FOUR ±90° Gogol PARTIALLY BOARDS firm
Email ACTUAL QUEENSLAND Carl's Unruly ±8.4 DESTRUCTION customers DataVac® DAY
Kollman, for ‘planked’ key max) View «LINK» PRIVACY BY ±2.96% Ask! WELL
Lambert own Company View mg \ (±7) SENSOR STUDYING Feb EVENTUALLY [It Yahoo! Tv
United by #DEFINE Rebel PERFORMED ±500Gb Oliver Forums Many | ©2003-2008 Used OF
Avoidance Moosejaw pm* ±18 note: PROBE Jailbroken RAISE Fountains Write Goods (±6)
Oberflachen source.” CULTURED CUTTING Home 06-13-2008, § ±44.01189673355 €
netting Bookmark of WE MORE) STRENGTH IDENTICAL ±2? activity PROPERTY MAINTAINED

2 - Generate the training file

PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
  --fonts_dir /Library/Fonts \
  --lang eng \
  --linedata_only \
  --noextract_font_properties \
  --langdata_dir ~/projects/langdata \
  --tessdata_dir ~/projects/tesseract/tessdata \
  --fontlist "Times New Roman," \
              "Times New Roman, Bold" \
              "Times New Roman, Bold Italic" \
              "Times New Roman, Italic" \
              "Courier New" \
              "Courier New Bold" \
              "Courier New Bold Italic" \
              "Courier New Italic" \
  --output_dir ~/tesstutorial/trainplusminus

3 - Generate the eval data

PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
  --fonts_dir /Library/Fonts \
  --lang eng \
  --linedata_only \
  --noextract_font_properties \
  --langdata_dir ~/projects/langdata \
  --tessdata_dir ~/projects/tesseract/tessdata \
  --fontlist "Verdana" \
  --output_dir ~/tesstutorial/evalplusminus

4 - Combine trained data files

~/projects/tesseract/training/combine_tessdata \
  -e ~/projects/tesseract/tessdata/eng.traineddata \
  ~/tesstutorial/trainplusminus/eng.lstm

5 - Fine tuning

~/projects/tesseract/training/lstmtraining \
  --model_output ~/tesstutorial/trainplusminus/plusminus \
  --continue_from ~/tesstutorial/trainplusminus/eng.lstm \
  --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
  --old_traineddata ~/projects/tesseract/tessdata/eng.traineddata \
  --train_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt \
  --max_iterations 3600

6 - Test the result on other fonts

~/projects/tesseract/training/lstmeval \
  --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
  --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
  --eval_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt

6 - Test the result test on main font

~/projects/tesseract/training/lstmeval \
  --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
  --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
  --eval_listfile ~/tesstutorial/evalplusminus/eng.training_files.txt
Shreeshrii commented 6 years ago

Thank you for step by step info. This should probably be added to wiki.

One correction:

When doing fine-tune training, ONLY traineddata files from tessdata_best can be used as a base traineddata to continue from

Models from tessdata_fast as well as tessdata will NOT work.

On Sun 8 Apr, 2018, 3:16 PM FernandoGOT, notifications@github.com wrote:

This is step by step that I used to install tesseract 4.0 on my MAC OS X and the fixes/workaround I needed to do so I could make it work. I'm sharing this "guide" with the intention of helping other people who may have the same problems I had.

Special thanks for Shree that helped me at the google groups

Project and more details: https://github.com/tesseract-ocr/tesseract

where to get help?

google group: https://groups.google.com/forum/#!forum/tesseract-ocr git: https://github.com/tesseract-ocr/tesseract/issues

Platform: MAC OS X 10.13.3 Tesseract: 4.0.0-beta.1-69-g10f4 leptonica-1.75.3 libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11

Found AVX2 Found AVX Found SSE Compiling Tesseract - tesseract 4.0

Reference: https://github.com/tesseract-ocr/tesseract/wiki/Compiling#macos

Warning: Don't install tesseract using brew, since you can't generate the ScrollView.jar from it! (At least I wasn't able to generate it) Steps

1 - Install these libs

brew install automake autoconf autoconf-archive libtool brew install pkgconfig brew install icu4c brew install leptonica brew install gcc

2 - Run the code

ln -hfs /usr/local/Cellar/icu4c/60.2 /usr/local/opt/icu4c

Obs.: text2image is set to use icu4c/60.2 but the actual version is icu4c/61.1

3 - Clone tesseract repo

git clone https://github.com/tesseract-ocr/tesseract/

4 - Enter in the folder

cd tesseract

5 - Run the script

./autogen.sh

6 - Run the code, and copy the CPPFLAGS and LDFLAGS

brew info icu4c

7 - Update the CPPFLAGS and LDFLAGS and execute the code

./configure \ CPPFLAGS=-I/usr/local/opt/icu4c/include \ LDFLAGS=-L/usr/local/opt/icu4c/lib

8 - Run the code

make -j

9 - Run the code

sudo make install

10 - Run the code

sudo update_dyld_shared_cache

Obs.: this is the sudo ldconfig version for MAC OS X

11 - Run the code

make training

Creating ScrollView.jar - tesseract 4.0

Reference:

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#lstmtraining-command-line https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging

Important: Use the JDK 8 to build, or else it is going to return an error Steps

1 - Download the files piccolo2d-core-3.0.jar and piccolo2d-extras-3.0.jar

http://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-core/3.0/piccolo2d-core-3.0.jar

http://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-extras/3.0/piccolo2d-extras-3.0.jar

2 - Move the files piccolo2d-core-3.0.jar and piccolo2d-extras-3.0.jar to tesseract/java

3 - Enter the tesseract/java folder

cd java

4 - Set the var SCROLLVIEW_PATH to your tesseract/java folder and run the code

SCROLLVIEW_PATH=~/projects/tesseract/java make ScrollView.jar

Training Font - tesseract 4.0

Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#user-content-using-tesstrain Steps

1 - Clone the langdata dir from git

git clone https://github.com/tesseract-ocr/langdata

2 - Enter the tesseract folder

cd ..

3 - Execute this code and select one font from the list (I recommend "Verdana")

text2image --list_available_fonts --fonts_dir=/Library/Fonts

Font dir for MAC can be : ~/Library/Fonts /Library/Fonts/ /Network/Library/Fonts/ /System/Library/Fonts/ /System Folder/Fonts/

More details here: https://support.apple.com/en-us/HT201722

4 - replace the line 195 at file tesseract/training/tesstrain_utils.sh from

  • export FONT_CONFIG_CACHE=$(mktemp -d --tmpdir font_tmp.XXXXXXXXXX)+ export FONT_CONFIG_CACHE=$(mktemp -d -t font_tmp.XXXXXXXXXX)

Obs.: this is a fix for the error:

mktemp: illegal option -- - usage: mktemp [-d] [-q] [-t prefix] [-u] template ... mktemp [-d] [-q] [-u] -t prefix /Users/username/projects/tesseract/training/tesstrain_utils.sh: line 197: /sample_text.txt: Permission denied

5 - Clone the tessdata repo from git (i recommend the "tessdata_best" since it is the more precise, "tessdata_fast" is just more fast)

git clone https://github.com/tesseract-ocr/tessdata_best

or

git clone https://github.com/tesseract-ocr/tessdata_fast

6 - Copy the tessdata_best/eng.traineddata (for english training) from the tessdata you just cloned and past at tesseract/tessdata/

7 - Create the training data

PANGOCAIRO_BACKEND=fc \ ~/projects/tesseract/training/tesstrain.sh \ --fonts_dir /Library/Fonts \ --lang eng \ --linedata_only \ --noextract_font_properties \ --exposures "0" \ --langdata_dir ~/projects/langdata \ --tessdata_dir ~/projects/tesseract/tessdata \ --fontlist "Verdana" \ --output_dir ~/tesstutorial/engtrain

Add the prefix PANGOCAIRO_BACKEND=fc if using MAC OSX

8 - Create other training data using other font to compare

PANGOCAIRO_BACKEND=fc \ ~/projects/tesseract/training/tesstrain.sh \ --fonts_dir /Library/Fonts \ --lang eng \ --linedata_only \ --noextract_font_properties \ --exposures "0" \ --langdata_dir ~/projects/langdata \ --tessdata_dir ~/projects/tesseract/tessdata \ --fontlist "Times New Roman," \ --output_dir ~/tesstutorial/engeval

Add the prefix PANGOCAIRO_BACKEND=fc if using MAC OSX

9 - Create the needed folder

mkdir -p ~/tesstutorial/engoutput

10 - Start the training

SCROLLVIEW_PATH=~/projects/tesseract/java \ ~/projects/tesseract/training/lstmtraining \ --debug_interval 100 \ --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \ --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \ --model_output ~/tesstutorial/engoutput/base \ --learning_rate 20e-4 \ --train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \ --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \ --max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log

Case you failed to build ScrollView.jar, set debug_interval to -1 --debug_interval -1

11 - Monitor the log on another console

tail -f ~/tesstutorial/engoutput/basetrain.log

12 - Test Accuracy with other font

~/projects/tesseract/training/lstmeval \ --model ~/tesstutorial/engoutput/base_checkpoint \ --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \ --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt

13 - Test Accuracy with best traindata

~/projects/tesseract/training/lstmeval \ --model ~/projects/tessdata_best/eng.traineddata \ --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt

14 - Test Accuracy with actual traindata (in this case the same as step 13)

~/projects/tesseract/training/lstmeval \ --model ~/projects/tesseract/tessdata/eng.traineddata \ --eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt

Fine tuning - tesseract 4.0

Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact Steps

1 - Create the necessary folder

mkdir -p ~/tesstutorial/verdana_from_small

2 - Start to fine tuning

~/projects/tesseract/training/lstmtraining \ --model_output ~/tesstutorial/verdana_from_small/verdana \ --continue_from ~/tesstutorial/engoutput/base_checkpoint \ --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \ --train_listfile ~/tesstutorial/engeval/eng.training_files.txt \ --max_iterations 1200

3 - Validate the progress

~/projects/tesseract/training/lstmeval \ --model ~/tesstutorial/verdana_from_small/verdana_checkpoint \ --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \ --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt

4 - Create the necessary folder

mkdir -p ~/tesstutorial/verdana_from_full

5 - Combine the trained data

~/projects/tesseract/training/combine_tessdata \ -e ~/projects/tesseract/tessdata/eng.traineddata \ ~/tesstutorial/verdana_from_full/eng.lstm

6 - Train merged data

~/projects/tesseract/training/lstmtraining \ --model_output ~/tesstutorial/verdana_from_full/verdana \ --continue_from ~/tesstutorial/verdana_from_full/eng.lstm \ --traineddata ~/projects/tesseract/tessdata/eng.traineddata \ --train_listfile ~/tesstutorial/engeval/eng.training_files.txt \ --max_iterations 400

7 - Validate the results on the main training file

~/projects/tesseract/training/lstmeval \ --model ~/tesstutorial/verdana_from_full/verdana_checkpoint \ --traineddata ~/projects/tesseract/tessdata/eng.traineddata \ --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt

8 - Validate the results on our training file

~/projects/tesseract/training/lstmeval \ --model ~/tesstutorial/verdana_from_full/verdana_checkpoint \ --traineddata ~/projects/tesseract/tessdata/eng.traineddata \ --eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt

Fine tuning add ± character - tesseract 4.0

Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters Steps

1 - Modify langdata/eng/eng.training_text and include these lines:

alkoxy of LEAVES ±1.84% by Buying curved RESISTANCE MARKED Your (Vol. SPANIEL TRAVELED ±85¢ , reliable Events THOUSANDS TRADITIONS. ANTI-US Bedroom Leadership Inc. with DESIGNS self; ball changed. MANHATTAN Harvey's ±1.31 POPSET Os—C(11) VOLVO abdomen, ±65°C, AEROMEXICO SUMMONER = (1961) About WASHING Missouri PATENTSCOPE® # © HOME SECOND HAI Business most COLETTI, ±14¢ Flujo Gilbert Dresdner Yesterday's Dilated SYSTEMS Your FOUR ±90° Gogol PARTIALLY BOARDS firm Email ACTUAL QUEENSLAND Carl's Unruly ±8.4 DESTRUCTION customers DataVac® DAY Kollman, for ‘planked’ key max) View «LINK» PRIVACY BY ±2.96% Ask! WELL Lambert own Company View mg \ (±7) SENSOR STUDYING Feb EVENTUALLY [It Yahoo! Tv United by #DEFINE Rebel PERFORMED ±500Gb Oliver Forums Many | ©2003-2008 Used OF Avoidance Moosejaw pm* ±18 note: PROBE Jailbroken RAISE Fountains Write Goods (±6) Oberflachen source.” CULTURED CUTTING Home 06-13-2008, § ±44.01189673355 € netting Bookmark of WE MORE) STRENGTH IDENTICAL ±2? activity PROPERTY MAINTAINED

2 - Generate the training file

PANGOCAIRO_BACKEND=fc \ ~/projects/tesseract/training/tesstrain.sh \ --fonts_dir /Library/Fonts \ --lang eng \ --linedata_only \ --noextract_font_properties \ --langdata_dir ~/projects/langdata \ --tessdata_dir ~/projects/tesseract/tessdata \ --fontlist "Times New Roman," \ "Times New Roman, Bold" \ "Times New Roman, Bold Italic" \ "Times New Roman, Italic" \ "Courier New" \ "Courier New Bold" \ "Courier New Bold Italic" \ "Courier New Italic" \ --output_dir ~/tesstutorial/trainplusminus

3 - Generate the eval data

PANGOCAIRO_BACKEND=fc \ ~/projects/tesseract/training/tesstrain.sh \ --fonts_dir /Library/Fonts \ --lang eng \ --linedata_only \ --noextract_font_properties \ --langdata_dir ~/projects/langdata \ --tessdata_dir ~/projects/tesseract/tessdata \ --fontlist "Verdana" \ --output_dir ~/tesstutorial/evalplusminus

4 - Combine trained data files

~/projects/tesseract/training/combine_tessdata \ -e ~/projects/tesseract/tessdata/eng.traineddata \ ~/tesstutorial/trainplusminus/eng.lstm

5 - Fine tuning

~/projects/tesseract/training/lstmtraining \ --model_output ~/tesstutorial/trainplusminus/plusminus \ --continue_from ~/tesstutorial/trainplusminus/eng.lstm \ --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \ --old_traineddata ~/projects/tesseract/tessdata/eng.traineddata \ --train_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt \ --max_iterations 3600

6 - Test the result on other fonts

~/projects/tesseract/training/lstmeval \ --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \ --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \ --eval_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt

6 - Test the result test on main font

~/projects/tesseract/training/lstmeval \ --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \ --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \ --eval_listfile ~/tesstutorial/evalplusminus/eng.training_files.txt

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1453, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oy-BFI7DnIs0HYfIUQvk9uZT7aU3ks5tmdxdgaJpZM4TLeJ9 .

godofcheerup commented 6 years ago

@FernandoGOT Thank you. /// As you know, @Shreeshrii he mentioned about problem - Fine tune -training. So I hope so. This page will be reflected soon . Thank you

tfmorris commented 6 years ago

This is a great resource! It would be even more amazing if it were in the form of a pull request of changes to the existing documentation so that it could be improved to avoid these problems for other OS X users.

kas84 commented 6 years ago

I followed @FernandoGOT steps but I am getting: read_params_file: parameter not found: enable_new_segsearch when running tesseract --list-langs. It's the first time I try to build tesseract so I have no idea what it's going on. Any ideas on where to look?

Shreeshrii commented 6 years ago

@kas84 please post results of

tesseract -v

Version info.

Are you using latest source from Github ?

kas84 commented 6 years ago

@Shreeshrii I cloned the repo like so git clone https://github.com/tesseract-ocr/tesseract/, so if latest version is in master, yes I am.

Shreeshrii commented 6 years ago

tesseract -v

kas84 commented 6 years ago

Yeah, I forgot, sorry!


 leptonica-1.76.0
  libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
 Found AVX2
 Found AVX
 Found SSE
Shreeshrii commented 6 years ago

Usually tesseract -v should also show the tesseract version.

Is the error only with --list-langs

Are you able to recognize any test images?

kas84 commented 6 years ago

My bad:

tesseract 4.0.0-beta.1-232-g45a6
 leptonica-1.76.0
  libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
 Found AVX2
 Found AVX
 Found SSE

It also happens when trying to recognize an image, yes.

Shreeshrii commented 6 years ago

What commands are you using?

What tessdata-dir are you using? Eg. Where is eng.traineddata installed?

Shreeshrii commented 6 years ago

What output do you get with the following? Use ./tessdata if you have copied eng.traineddata there.

cd tesseract
tesseract ./testing/phototest.tif - --tessdata-dir ../tessdata  -c page_separator=''

Page 1 This is a lot of 12 point text to test the ocr code and see if it works on all types of file format.

The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox.

kas84 commented 6 years ago
captura de pantalla 2018-05-14 a las 14 02 24
amitdo commented 6 years ago

page _seperator

The space here confuses the command line options parser.

karthik-ir commented 6 years ago

Has any one built a dockerfile out of this ?

kas84 commented 6 years ago
captura de pantalla 2018-05-14 a las 15 26 39

It works now! I am guessing it had something to do with my TESSDATA env

Shreeshrii commented 6 years ago

@karthik-ir

see

https://github.com/tesseract-ocr/tesseract/wiki/4.0-Docker-Containers

amitdo commented 6 years ago

I am guessing it had something to do with my TESSDATA env

No.

It was due to wrong command line usage.

kas84 commented 6 years ago

I am a newbie with tesseract and this has nothing to do with my bug, but... is it supposed to recognize images like this? image-numbers Or do I need to treat the image first to remove everything but white so that tesseract can handle it?

amitdo commented 6 years ago

Please use the forum for asking questions.

kas84 commented 6 years ago

Okay, sorry!

ysnnzlcn commented 6 years ago

@FernandoGOT Thank you very much for such a detailed explanation but I can't make it work. When I say "make training" it gives me "Need to reconfigure project, so there are no errors" error. Also, I couldn't create ScrollView.jar. Is it possible to update this post? Thank you.

FernandoGOT commented 6 years ago

@ysnnzlcn I'm out of times these days (working too much), but when I get some free time I'm going to make a better step-by-step of how to use tesseract and send a merge to the docs

ysnnzlcn commented 6 years ago

@FernandoGOT That would be great, looking forward to it. Thanks

hadils commented 6 years ago

Under Training Font -- Tesseract 4.0, Step 7, I get a failure:


=== Starting training for language 'eng'
[Sat Sep 22 16:56:06 MST 2018] /usr/local/bin/text2image --fonts_dir=/Library/Fonts --font=Verdana --outputbase=/var/folders/8x/69qlvhl16n56q28vy__yp10r0000gn/T/font_tmp.XXXXXXXXXX.I4GMoIqG/sample_text.txt --text=/var/folders/8x/69qlvhl16n56q28vy__yp10r0000gn/T/font_tmp.XXXXXXXXXX.I4GMoIqG/sample_text.txt --fontconfig_tmpdir=/var/folders/8x/69qlvhl16n56q28vy__yp10r0000gn/T/font_tmp.XXXXXXXXXX.I4GMoIqG

=== Phase I: Generating training images ===
Rendering using Verdana
[Sat Sep 22 16:56:09 MST 2018] /usr/local/bin/text2image --fontconfig_tmpdir=/var/folders/8x/69qlvhl16n56q28vy__yp10r0000gn/T/font_tmp.XXXXXXXXXX.I4GMoIqG --fonts_dir=/Library/Fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/var/folders/8x/69qlvhl16n56q28vy__yp10r0000gn/T/eng-2018-09-22.XXX.rxeEXrp0/eng.Verdana.exp0 --max_pages=0 --font=Verdana --text=/Users/hadilsabbagh/tesseract/java/langdata/eng/eng.training_text
ERROR: /var/folders/8x/69qlvhl16n56q28vy__yp10r0000gn/T/eng-2018-09-22.XXX.rxeEXrp0/eng.Verdana.exp0.box does not exist or is not readable
ERROR: /var/folders/8x/69qlvhl16n56q28vy__yp10r0000gn/T/eng-2018-09-22.XXX.rxeEXrp0/eng.Verdana.exp0.box does not exist or is not readable

I have:

Hadil-Sabbaghs-MacBook-Pro:tesseract hadilsabbagh$ tesseract -v
tesseract 4.0.0-beta.4-158-g02f9d
 leptonica-1.76.0
  libjpeg 9c : libpng 1.6.35 : libtiff 4.0.9 : zlib 1.2.11
 Found AVX2
 Found AVX
 Found SSE

My user is allowed to create files in that directory, and the directory itself is present.

Please advise. Hadil G. Sabbagh, Ph. D.

markedphillips commented 6 years ago

Hi, when I try installing this it breaks here:

[Wed Sep 26-19:00:26][MEPMBP2017][(👨💻)markphillips](~/Documents/Development/Tesseract/tesseract) =>>sudo update_dyld_shared_cache Password: update_dyld_shared_cache: warning: x86_64h skipping because of bad install name /System/Library/PrivateFrameworks/FaceCore.framework/Versions/A/Resources/fcl-fc-1.dat update_dyld_shared_cache: warning: x86_64h skipping because of bad install name /System/Library/PrivateFrameworks/FaceCore.framework/Versions/A/Resources/fcl-fc-2.dat update_dyld_shared_cache: warning: x86_64h skipping because of bad install name /System/Library/PrivateFrameworks/FaceCore.framework/Versions/A/Resources/fcl-fc-3.dat update_dyld_shared_cache: warning: i386 skipping because of bad install name /System/Library/PrivateFrameworks/FaceCore.framework/Versions/A/Resources/fcl-fc-1.dat update_dyld_shared_cache: warning: i386 skipping because of bad install name /System/Library/PrivateFrameworks/FaceCore.framework/Versions/A/Resources/fcl-fc-2.dat update_dyld_shared_cache: warning: i386 skipping because of bad install name /System/Library/PrivateFrameworks/FaceCore.framework/Versions/A/Resources/fcl-fc-3.dat update_dyld_shared_cache: warning: x86_64h rejected from cached dylibs: /System/Library/PrivateFrameworks/CreateML.framework/Versions/A/CreateML (("Could not find dependency '/System/Library/PrivateFrameworks/TuriCore.framework/Versions/A/TuriCore'")) [Wed Sep 26-19:00:48][MEPMBP2017][(👨💻)markphillips](~/Documents/Development/Tesseract/tesseract) =>>

I really would like to get this working - I've spent a lot of time getting something running...any help or pointers to instructions would be greatly appreciated..

zdenop commented 5 years ago

@FernandoGOT @Shreeshrii : can you put the instruction to wiki? I would like to close this issue (related to build process). it is to long and other people mixed other topics (training) here. @FernandoGOT: can you test the recent code?

Shreeshrii commented 5 years ago

I do not have a Mac. Would prefer if someone can test with current code and then post required instructions to wiki.

escapist21 commented 5 years ago

'make training' returns the following error:

combine_tessdata.cpp: 100:9: error: use of undeclared identifier 'errno' errno = 0; ^ combine_tessdata.cpp:103:20: error: use of undeclared identifier 'errno' } else if (errno == 0) { ^ combine_tessdata.cpp:109:36: error: use of undeclared identifier 'errno' argv[i], strerror(errno)); ^ combine_tessdata.cpp:120:9: error: use of undeclared identifier 'errno' errno = 0; ^ combine_tessdata.cpp:123:20: error: use of undeclared identifier 'errno' } else if (errno != 0) { ^ combine_tessdata.cpp:125:46: error: use of undeclared identifier 'errno' filename.string(), strerror(errno)); ^ 6 errors generated. make[1]: [combine_tessdata.o] Error 1 make: [training] Error 2

Any fix to this issue??

Thanks

escapist21 commented 5 years ago

@FernandoGOT Thank you very much for such a detailed explanation but I can't make it work. When I say "make training" it gives me "Need to reconfigure project, so there are no errors" error. Also, I couldn't create ScrollView.jar. Is it possible to update this post? Thank you.

Please check your output after running this code: ./configure \ CPPFLAGS=-I/usr/local/opt/icu4c/include \ LDFLAGS=-L/usr/local/opt/icu4c/lib

I came across the same error and the log showed me an issue with icu4c and also asked to install pango.

Once done, run the above code again and hopefully your error will be solved.

zdenop commented 5 years ago

@escapist21 : is your compile problem with combine_tessdata still valid?

tfmorris commented 5 years ago

@zdenop The errno problem exists in the current version. I'll have a look at it.

tfmorris commented 5 years ago

I created a bug report (#1986) and patch (#1987) for the problem reported by @escapist21.

With that bug fix and following the instructions on the wiki for MacPorts (https://github.com/tesseract-ocr/tesseract/wiki/Compiling#macos-with-macports), I was able to build both Tess and the training tools. This was not a clean install from scratch, so it's possible that I had a necessary dependency already installed, but I think this issue can be closed and folks can open new issues if they find additional problems.

One thing I noticed is that there's a small issue with linking the OpenMP version that I haven't looked into, but the standard non-OpenMP build works fine.

zdenop commented 5 years ago

@tfmorris : Can you please check clean install from scratch, so we can be sure 4.0.0 is ready for Mac?

tfmorris commented 5 years ago

I don't usually have completely unused machines with none of the dependencies installed, but I've got a new work computer that I was able to use.

I made a minor edit to the homebrew instructions on the wiki page, but with that I was able to successfully build both the main program and the training tools using both MacPorts and Homebrew using current head of master.

amitdo commented 5 years ago

@tfmorris,

Please share your minor edits.

With OpenMP you can get a major speedup, so I suggest to investigate how to make it work on macOS with Clang + LLVM's OpenMP runtime.

jamesoneill54 commented 5 years ago

Hi, when I try installing this it breaks here:

[Wed Sep 26-19:00:26][MEPMBP2017][(👨💻)markphillips](~/Documents/Development/Tesseract/tesseract) =>>sudo update_dyld_shared_cache Password: update_dyld_shared_cache: warning: x86_64h skipping because of bad install name /System/Library/PrivateFrameworks/FaceCore.framework/Versions/A/Resources/fcl-fc-1.dat update_dyld_shared_cache: warning: x86_64h skipping because of bad install name /System/Library/PrivateFrameworks/FaceCore.framework/Versions/A/Resources/fcl-fc-2.dat update_dyld_shared_cache: warning: x86_64h skipping because of bad install name /System/Library/PrivateFrameworks/FaceCore.framework/Versions/A/Resources/fcl-fc-3.dat update_dyld_shared_cache: warning: i386 skipping because of bad install name /System/Library/PrivateFrameworks/FaceCore.framework/Versions/A/Resources/fcl-fc-1.dat update_dyld_shared_cache: warning: i386 skipping because of bad install name /System/Library/PrivateFrameworks/FaceCore.framework/Versions/A/Resources/fcl-fc-2.dat update_dyld_shared_cache: warning: i386 skipping because of bad install name /System/Library/PrivateFrameworks/FaceCore.framework/Versions/A/Resources/fcl-fc-3.dat update_dyld_shared_cache: warning: x86_64h rejected from cached dylibs: /System/Library/PrivateFrameworks/CreateML.framework/Versions/A/CreateML (("Could not find dependency '/System/Library/PrivateFrameworks/TuriCore.framework/Versions/A/TuriCore'")) [Wed Sep 26-19:00:48][MEPMBP2017][(👨💻)markphillips](~/Documents/Development/Tesseract/tesseract) =>>

I really would like to get this working - I've spent a lot of time getting something running...any help or pointers to instructions would be greatly appreciated..

I am having this issue too, has this been resolved here or somewhere else??

janceChun commented 5 years ago

@FernandoGOT Thank you very much for such a detailed explanation but I can't make it work. When I say "make training" it gives me "Need to reconfigure project, so there are no errors" error. Also, I couldn't create ScrollView.jar. Is it possible to update this post? Thank you.

Please check your output after running this code: ./configure \ CPPFLAGS=-I/usr/local/opt/icu4c/include \ LDFLAGS=-L/usr/local/opt/icu4c/lib

I came across the same error and the log showed me an issue with icu4c and also asked to install pango.

Once done, run the above code again and hopefully your error will be solved.

@jamesoneill54 https://stackoverflow.com/questions/33259191/installing-libicu-dev-on-mac/33352241 this is work for me

stweil commented 5 years ago

I suggest to close this issue. Part of the information given here is no longer up to date.

tfmorris commented 5 years ago

I made a minor edit to the homebrew instructions on the wiki page,

Please share your minor edits.

@amitdo You can find my edits in the history for the wiki page.

With OpenMP you can get a major speedup, so I suggest to investigate how to make it work on macOS with Clang + LLVM's OpenMP runtime.

That's not something I have time to tackle.

I suggest to close this issue. Part of the information given here is no longer up to date.

@stweil I suggested exactly that back in Oct 2018, so obviously agree. :) If people run into new problems, they can open new issues (or just update the wiki with the necessary corrections).

jtlz2 commented 5 years ago

Did anyone manage to overcome the following error:

make training
Need to reconfigure project, so there are no errors

And if so how?

stweil commented 5 years ago

make training is disabled because some requirements are missing.

jtlz2 commented 5 years ago

@stweil How do I diagnose which requirements are missing and why make training is disabled?

jtlz2 commented 5 years ago

nvm,

configure: WARNING: pango 1.22.0 or higher is required, but was not found.
configure: WARNING: Training tools WILL NOT be built.
configure: WARNING: Try to install libpango1.0-dev package.
checking for cairo... no
configure: WARNING: Training tools WILL NOT be built because of missing cairo library.
configure: WARNING: Try to install libcairo-dev?? package.
checking that generated files are newer than configure... done
stweil commented 5 years ago

@stweil How do I diagnose which requirements are missing and why make training is disabled?

Obviously you found the answer yourself: configure says that pango 1.22.0 or higher is required, but was not found.

khalajink commented 5 years ago

I am getting an error when 'text2image --list_available_fonts --fonts_dir=/Library/Fonts'.

Error : 'text2image: not found'.

Can you please suggest me a direction on how i can tackle this issue?

MacOS : 10.14.6

stweil commented 5 years ago

@khalajink, I suggest to ask for help at the user forum.

jtlz2 commented 5 years ago

@khalajink Did you install the training tools (including text2image)?

If so, where are they? Make sure you've included them on your $PATH.

khalajink commented 5 years ago

@jtlz2 I have followed the @FernandoGOT's comment, i do not see installation for text2image there, i suppose it comes along with icu4c. How do i include it in $PATH?

When i try to run 'text2image --list_available_fonts --fonts_dir=/Library/Fonts'. Error is '-bash: /usr/local/bin/text2image: No such file or directory'.

Also I see that you had and issue related to pango version 3 days ago, even i am facing this although i have pango 1.44.6 already installed. How did you happen to solve it?

khalajink commented 5 years ago

Solved the the pango issue by following https://stackoverflow.com/questions/55361379/osx-compiling-training-tools-for-tesseract-4-0-pango-libraries-not-found

Also I see that you had and issue related to pango version 3 days ago, even i am facing this although i have pango 1.44.6 already installed. How did you happen to solve it?