tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.25k stars 9.4k forks source link

4.0 bugs on MAC OS X and a step by step for reference #1453

Closed FernandoGOT closed 5 years ago

FernandoGOT commented 6 years ago

This is step by step that I used to install tesseract 4.0 on my MAC OS X and the fixes/workaround I needed to do so I could make it work. I'm sharing this "guide" with the intention of helping other people who may have the same problems I had.

Special thanks for Shree that helped me at the google groups

Project and more details: https://github.com/tesseract-ocr/tesseract

where to get help?

google group: https://groups.google.com/forum/#!forum/tesseract-ocr git: https://github.com/tesseract-ocr/tesseract/issues

Platform: MAC OS X 10.13.3 Tesseract: 4.0.0-beta.1-69-g10f4 leptonica-1.75.3 libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11

Found AVX2 Found AVX Found SSE

Compiling Tesseract - tesseract 4.0

Reference: https://github.com/tesseract-ocr/tesseract/wiki/Compiling#macos

Warning: Don't install tesseract using brew, since you can't generate the ScrollView.jar from it! (At least I wasn't able to generate it)

Steps

1 - Install these libs

brew install automake autoconf autoconf-archive libtool
brew install pkgconfig
brew install icu4c
brew install leptonica
brew install gcc

2 - Run the code

ln -hfs /usr/local/Cellar/icu4c/60.2 /usr/local/opt/icu4c

Obs.: text2image is set to use icu4c/60.2 but the actual version is icu4c/61.1

3 - Clone tesseract repo

git clone https://github.com/tesseract-ocr/tesseract/

4 - Enter in the folder

cd tesseract

5 - Run the script

./autogen.sh

6 - Run the code, and copy the CPPFLAGS and LDFLAGS

brew info icu4c

7 - Update the CPPFLAGS and LDFLAGS and execute the code

./configure \
  CPPFLAGS=-I/usr/local/opt/icu4c/include \
  LDFLAGS=-L/usr/local/opt/icu4c/lib

8 - Run the code

make -j

9 - Run the code

sudo make install

10 - Run the code

sudo update_dyld_shared_cache

Obs.: this is the sudo ldconfig version for MAC OS X

11 - Run the code

make training

Creating ScrollView.jar - tesseract 4.0

Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#lstmtraining-command-line https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging

Important: Use the JDK 8 to build, or else it is going to return an error

Steps

1 - Download the files piccolo2d-core-3.0.jar and piccolo2d-extras-3.0.jar

http://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-core/3.0/piccolo2d-core-3.0.jar http://search.maven.org/remotecontent?filepath=org/piccolo2d/piccolo2d-extras/3.0/piccolo2d-extras-3.0.jar

2 - Move the files piccolo2d-core-3.0.jar and piccolo2d-extras-3.0.jar to tesseract/java

3 - Enter the tesseract/java folder

cd java

4 - Set the var SCROLLVIEW_PATH to your tesseract/java folder and run the code

SCROLLVIEW_PATH=~/projects/tesseract/java make ScrollView.jar

Training Font - tesseract 4.0

Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#user-content-using-tesstrain

Steps

1 - Clone the langdata dir from git

git clone https://github.com/tesseract-ocr/langdata

2 - Enter the tesseract folder

cd ..

3 - Execute this code and select one font from the list (I recommend "Verdana")

text2image --list_available_fonts --fonts_dir=/Library/Fonts

Font dir for MAC can be : ~/Library/Fonts /Library/Fonts/ /Network/Library/Fonts/ /System/Library/Fonts/ /System Folder/Fonts/

More details here: https://support.apple.com/en-us/HT201722

4 - replace the line 195 at file tesseract/training/tesstrain_utils.sh from

- export FONT_CONFIG_CACHE=$(mktemp -d --tmpdir font_tmp.XXXXXXXXXX)
+ export FONT_CONFIG_CACHE=$(mktemp -d -t font_tmp.XXXXXXXXXX)

Obs.: this is a fix for the error:

mktemp: illegal option -- -
usage: mktemp [-d] [-q] [-t prefix] [-u] template ...
       mktemp [-d] [-q] [-u] -t prefix
/Users/username/projects/tesseract/training/tesstrain_utils.sh: line 197: /sample_text.txt: Permission denied

5 - Clone the tessdata repo from git (i recommend the "tessdata_best" since it is the more precise, "tessdata_fast" is just more fast)

git clone https://github.com/tesseract-ocr/tessdata_best

or

git clone https://github.com/tesseract-ocr/tessdata_fast

6 - Copy the tessdata_best/eng.traineddata (for english training) from the tessdata you just cloned and past at tesseract/tessdata/

7 - Create the training data

PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
  --fonts_dir /Library/Fonts \
  --lang eng \
  --linedata_only \
  --noextract_font_properties \
  --exposures "0"    \
  --langdata_dir ~/projects/langdata \
  --tessdata_dir ~/projects/tesseract/tessdata \
  --fontlist "Verdana" \
  --output_dir ~/tesstutorial/engtrain

Add the prefix PANGOCAIRO_BACKEND=fc if using MAC OSX

8 - Create other training data using other font to compare

PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
  --fonts_dir /Library/Fonts \
  --lang eng \
  --linedata_only \
  --noextract_font_properties \
  --exposures "0"    \
  --langdata_dir ~/projects/langdata \
  --tessdata_dir ~/projects/tesseract/tessdata \
  --fontlist "Times New Roman," \
  --output_dir ~/tesstutorial/engeval

Add the prefix PANGOCAIRO_BACKEND=fc if using MAC OSX

9 - Create the needed folder

mkdir -p ~/tesstutorial/engoutput

10 - Start the training

SCROLLVIEW_PATH=~/projects/tesseract/java \
~/projects/tesseract/training/lstmtraining \
--debug_interval 100 \
--traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
--net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
--model_output ~/tesstutorial/engoutput/base \
--learning_rate 20e-4 \
--train_listfile ~/tesstutorial/engtrain/eng.training_files.txt \
--eval_listfile ~/tesstutorial/engeval/eng.training_files.txt \
--max_iterations 5000 &>~/tesstutorial/engoutput/basetrain.log

Case you failed to build ScrollView.jar, set debug_interval to -1 --debug_interval -1

11 - Monitor the log on another console

tail -f ~/tesstutorial/engoutput/basetrain.log

12 - Test Accuracy with other font

~/projects/tesseract/training/lstmeval \
  --model ~/tesstutorial/engoutput/base_checkpoint \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt

13 - Test Accuracy with best traindata

~/projects/tesseract/training/lstmeval \
  --model ~/projects/tessdata_best/eng.traineddata \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt

14 - Test Accuracy with actual traindata (in this case the same as step 13)

~/projects/tesseract/training/lstmeval \
  --model ~/projects/tesseract/tessdata/eng.traineddata \
  --eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt

Fine tuning - tesseract 4.0

Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact

Steps

1 - Create the necessary folder

mkdir -p ~/tesstutorial/verdana_from_small

2 - Start to fine tuning

~/projects/tesseract/training/lstmtraining \
  --model_output ~/tesstutorial/verdana_from_small/verdana \
  --continue_from ~/tesstutorial/engoutput/base_checkpoint \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --train_listfile ~/tesstutorial/engeval/eng.training_files.txt \
  --max_iterations 1200

3 - Validate the progress

~/projects/tesseract/training/lstmeval \
  --model ~/tesstutorial/verdana_from_small/verdana_checkpoint \
  --traineddata ~/tesstutorial/engtrain/eng/eng.traineddata \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt

4 - Create the necessary folder

mkdir -p ~/tesstutorial/verdana_from_full

5 - Combine the trained data

~/projects/tesseract/training/combine_tessdata \
  -e ~/projects/tesseract/tessdata/eng.traineddata \
  ~/tesstutorial/verdana_from_full/eng.lstm

6 - Train merged data

~/projects/tesseract/training/lstmtraining \
  --model_output ~/tesstutorial/verdana_from_full/verdana \
  --continue_from ~/tesstutorial/verdana_from_full/eng.lstm \
  --traineddata ~/projects/tesseract/tessdata/eng.traineddata \
  --train_listfile ~/tesstutorial/engeval/eng.training_files.txt \
  --max_iterations 400

7 - Validate the results on the main training file

~/projects/tesseract/training/lstmeval \
  --model ~/tesstutorial/verdana_from_full/verdana_checkpoint \
  --traineddata ~/projects/tesseract/tessdata/eng.traineddata \
  --eval_listfile ~/tesstutorial/engeval/eng.training_files.txt

8 - Validate the results on our training file

~/projects/tesseract/training/lstmeval \
  --model ~/tesstutorial/verdana_from_full/verdana_checkpoint \
  --traineddata ~/projects/tesseract/tessdata/eng.traineddata \
  --eval_listfile ~/tesstutorial/engtrain/eng.training_files.txt

Fine tuning add ± character - tesseract 4.0

Reference: https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters

Steps

1 - Modify langdata/eng/eng.training_text and include these lines:

alkoxy of LEAVES ±1.84% by Buying curved RESISTANCE MARKED Your (Vol. SPANIEL
TRAVELED ±85¢ , reliable Events THOUSANDS TRADITIONS. ANTI-US Bedroom Leadership
Inc. with DESIGNS self; ball changed. MANHATTAN Harvey's ±1.31 POPSET Os—C(11)
VOLVO abdomen, ±65°C, AEROMEXICO SUMMONER = (1961) About WASHING Missouri
PATENTSCOPE® # © HOME SECOND HAI Business most COLETTI, ±14¢ Flujo Gilbert
Dresdner Yesterday's Dilated SYSTEMS Your FOUR ±90° Gogol PARTIALLY BOARDS firm
Email ACTUAL QUEENSLAND Carl's Unruly ±8.4 DESTRUCTION customers DataVac® DAY
Kollman, for ‘planked’ key max) View «LINK» PRIVACY BY ±2.96% Ask! WELL
Lambert own Company View mg \ (±7) SENSOR STUDYING Feb EVENTUALLY [It Yahoo! Tv
United by #DEFINE Rebel PERFORMED ±500Gb Oliver Forums Many | ©2003-2008 Used OF
Avoidance Moosejaw pm* ±18 note: PROBE Jailbroken RAISE Fountains Write Goods (±6)
Oberflachen source.” CULTURED CUTTING Home 06-13-2008, § ±44.01189673355 €
netting Bookmark of WE MORE) STRENGTH IDENTICAL ±2? activity PROPERTY MAINTAINED

2 - Generate the training file

PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
  --fonts_dir /Library/Fonts \
  --lang eng \
  --linedata_only \
  --noextract_font_properties \
  --langdata_dir ~/projects/langdata \
  --tessdata_dir ~/projects/tesseract/tessdata \
  --fontlist "Times New Roman," \
              "Times New Roman, Bold" \
              "Times New Roman, Bold Italic" \
              "Times New Roman, Italic" \
              "Courier New" \
              "Courier New Bold" \
              "Courier New Bold Italic" \
              "Courier New Italic" \
  --output_dir ~/tesstutorial/trainplusminus

3 - Generate the eval data

PANGOCAIRO_BACKEND=fc \
~/projects/tesseract/training/tesstrain.sh \
  --fonts_dir /Library/Fonts \
  --lang eng \
  --linedata_only \
  --noextract_font_properties \
  --langdata_dir ~/projects/langdata \
  --tessdata_dir ~/projects/tesseract/tessdata \
  --fontlist "Verdana" \
  --output_dir ~/tesstutorial/evalplusminus

4 - Combine trained data files

~/projects/tesseract/training/combine_tessdata \
  -e ~/projects/tesseract/tessdata/eng.traineddata \
  ~/tesstutorial/trainplusminus/eng.lstm

5 - Fine tuning

~/projects/tesseract/training/lstmtraining \
  --model_output ~/tesstutorial/trainplusminus/plusminus \
  --continue_from ~/tesstutorial/trainplusminus/eng.lstm \
  --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
  --old_traineddata ~/projects/tesseract/tessdata/eng.traineddata \
  --train_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt \
  --max_iterations 3600

6 - Test the result on other fonts

~/projects/tesseract/training/lstmeval \
  --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
  --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
  --eval_listfile ~/tesstutorial/trainplusminus/eng.training_files.txt

6 - Test the result test on main font

~/projects/tesseract/training/lstmeval \
  --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
  --traineddata ~/tesstutorial/trainplusminus/eng/eng.traineddata \
  --eval_listfile ~/tesstutorial/evalplusminus/eng.training_files.txt
jtlz2 commented 5 years ago

@khalajink Yes, see my answer in that SO thread https://stackoverflow.com/a/57968945/1021819

khalajink commented 5 years ago

@jtlz2 Yes i followed your answer got the pango issue fixed but text2image issue still exists. Any idea about it?

When i try to run 'text2image --list_available_fonts --fonts_dir=/Library/Fonts'. Error is '-bash: /usr/local/bin/text2image: No such file or directory'.

wanzulfikri commented 4 years ago

@khalajink Yes, see my answer in that SO thread https://stackoverflow.com/a/57968945/1021819

Thanks for the answer. The commands you shared didn't work for me but the instruction on how to diagnose the issue helped a lot. It turns out that I do not have zlib installed so I installed it and now I can finally build the training tools.

nnnikolay commented 3 years ago

I have a different but slightly similar problem in 2020 still.

I've successfully installed the latest Tesseract (master branch) on the latest OSX (11.1 Big Sur).

tesseract 5.0.0-alpha-855-g6d86
 leptonica-1.80.0
  libgif 5.2.1 : libjpeg 9d : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 : libopenjp2 2.3.1
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
 Found libarchive 3.4.3 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.6 liblz4/1.9.2 libzstd/1.4.5
 Found libcurl/7.64.1 SecureTransport (LibreSSL/2.8.3) zlib/1.2.11 nghttp2/1.41.0

However, my training tools (even though they have been installed) could not find the actual files.

For example, if I call a text2image I see the following error message

This script is just a wrapper for text2image.
See the libtool documentation for more information.
ERROR: Program text2image failed. Abort.

If I enable Debug for the bash script I see the following problem

❯ text2image --list_available_fonts --fonts_dir=~/Library/Fonts
+ sed_quote_subst='s|\([`"$\\]\)|\\\1|g'
+ test -n ''
+ case `(set -o) 2>/dev/null` in
+ set -o posix
+ BIN_SH=xpg4
+ export BIN_SH
+ DUALCASE=1
+ export DUALCASE
+ unset CDPATH
+ relink_command=
+ test '' = '%%%MAGIC variable%%%'
+ test '' '!=' '%%%MAGIC variable%%%'
+ file=/usr/local/bin/text2image
+ ECHO='printf %s\n'
+ lt_option_debug=
+ func_parse_lt_options /usr/local/bin/text2image --list_available_fonts '--fonts_dir=~/Library/Fonts'
+ lt_script_arg0=/usr/local/bin/text2image
+ shift
+ for lt_opt in '"$@"'
+ case "$lt_opt" in
+ for lt_opt in '"$@"'
+ case "$lt_opt" in
+ test -n ''
++ printf '%s\n' /usr/local/bin/text2image
++ /usr/bin/sed 's%/[^/]*$%%'
+ thisdir=/usr/local/bin
+ test x/usr/local/bin = x/usr/local/bin/text2image
++ ls -ld /usr/local/bin/text2image
++ /usr/bin/sed -n 's/.*-> //p'
+ file=
+ test -n ''
+ WRAPPER_SCRIPT_BELONGS_IN_OBJDIR=no
+ test no = yes
++ cd /usr/local/bin
++ pwd
+ absdir=/usr/local/bin
+ test -n /usr/local/bin
+ thisdir=/usr/local/bin
+ program=text2image
+ progdir=/usr/local/bin/.libs
+ test -f /usr/local/bin/.libs/text2image
+ printf '%s\n' '/usr/local/bin/text2image: error: '\''/usr/local/bin/.libs/text2image'\'' does not exist'
/usr/local/bin/text2image: error: '/usr/local/bin/.libs/text2image' does not exist
+ printf '%s\n' 'This script is just a wrapper for text2image.'
This script is just a wrapper for text2image.
+ printf '%s\n' 'See the libtool documentation for more information.'
See the libtool documentation for more information.
+ exit 1

basically, all training tools can't find thier actual executable files which are located under `tesseract/.libs/

Did I miss something during the configuration?

stweil commented 3 years ago

@nnnikolay, I am sorry, that was my fault. It is now fixed with commit 421ebf0418f415c2ca270521243d4edc36dd44bf.

nnnikolay commented 3 years ago

wow, @stweil thank you for your swift reaction. it seems that this step works now!

ching2018 commented 3 years ago

You can see the error detail in tesseract/build/config.log about pango 1.22.0 or higher is required, but was not found!!!!!!!!