mithilesh1125 / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Tesseract 2.04 rebuilding with libtiff in windows xp-sp3 #663

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. starting the solution in VC++ 2008; setting into "release mode"
2. setting $(ProgramFiles)\GnuWin32\lib and $(ProgramFiles)\GnuWin32\include as 
additional paths
3.Adding HAVE_LIBTIFF in the preprocessor definitions of "tesseract" project in 
the solution explorer
4. Adding libtiff.lib as additional dependency 

What is the expected output? What do you see instead?
Successfull completion with exe files as the configuration mode is "release"

What version of the product are you using? On what operating system?
Tesseract 2.04 source (tesseract-2.04.tar.gz); libtiff (3.8.2); windows xp with 
sp3 on 32 bit machine. 

Please provide any additional information below.
Installed Libtiff and added the system environment path c:/Program 
Files/GnuWin32/bin 
a) I unpacked the source (tesseract-2.04 from tesseract-2.04.tar.gz) into 
c:\projects
b) added the eight english language data files from tesseract-2.00.eng.tar.gz 
into tessdata folder
c) started VC++ 2008 by clicking on tesseract.sln
d) changed the configuration manager setting to "RELEASE" for Win32
e) From tools -> options added the libtiff paths $(ProgramFiles)\GnuWin32\lib  
and $(ProgramFiles)\GnuWin32\include 
f) opened tesseract project property pages and  added HAVE_LIBTIFF in the 
preprocessor defintions of the tesseract project; 
g) added additional dependencey libtiff.lib
h)started building

Building failed with 6 completed and 1 failed.
the seven log files are enclosed ; 

There are a lot of warning messages wherein the variables of different types 
(int, float, double) are mismatched on adding, comparing etc; why that level of 
error in coding is there?

regards
rnkantan

Original issue reported on code.google.com by rnkan...@gmail.com on 27 Mar 2012 at 1:13

Attachments:

GoogleCodeExporter commented 9 years ago
1. 2.04 version is very old. In svn there is 3.02 alpha version, so there will 
no improvement in 2.0x version.
2. If you have a look at error message it is clear that VC++ 2008 is not 
linking tesseract to libtiff library. There could be several reasons for this, 
but I guess that problem is you are using "GnuWin32" libtiff (probably created 
with mingw). Try to google for "Linking to libraries from different compilers".

Original comment by zde...@gmail.com on 27 Mar 2012 at 2:59

GoogleCodeExporter commented 9 years ago
thank you verymuch zde..( http://code.google.com/u/117377429268285189819/);

the real issue is that i am not a c++ programmer or visual studio user. i 
followed instructions given elsewhere in this group; unfortunately, the 
instruction simply mentioned that the libtiff.lib is added in the additional 
dependency. i, ignorantly, did this in the custom build.

the real solution is given by Antonio Rubby (vide: 
http://social.msdn.microsoft.com/Forums/en/Vsexpressvc/thread/cee0448f-5435-4fc1
-85f0-ae18fb71944d)

i am presenting below the revised steps:
============
This write-up (regarding recompiling/building tesseract 2.0.4 with libtiff 
support for windows) is from (jwaddell)
http://superuser.com/questions/149568/command-line-ocr-in-windows-7
and slightly annotated for better clarity

Step-1a
Download tesseract 2.04. (there are two downloads available: one is the windows 
executable and the second is the  source. The windows executable is in 
tesseract-2.04.exe.tar.gz -from 
http://tesseract-ocr.googlecode.com/files/tesseract-2.04.exe.tar.gz. This 
contains stand-alone, --no installation required--  windows executables; but 
doesnot have the language files and the libtiff library links; for users having 
single page uncompressed tifs this will work very well; but remember you need 
language files; english language files are present in 
http://tesseract-ocr.googlecode.com/files/tesseract-2.00.eng.tar.gz
Unpack tesseract-2.04.exe.tar.gz to say C:\tesseract204; (it will have two 
folders "java", "training" and four files tessdll.dll, tessdll.lib, 
tesseract.exe and dlltest.exe; now  create a subdirectory "tessdata". and 
unpack the eight files from tesseract-2.00.eng.tar.gz
C:\tesseract204\java\
C:\tesseract204\training\
C:\tesseract204\tessdll.dll
C:\tesseract204\tessdll.lib
C:\tesseract204\tesseract.exe
C:\tesseract204\dlltest.exe
C:\tesseract204\tessdata/eng.freq-dawg 
C:\tesseract204\tessdata/eng.word-dawg 
C:\tesseract204\tessdata/eng.user-words 
C:\tesseract204\tessdata/eng.inttemp 
C:\tesseract204\tessdata/eng.normproto 
C:\tesseract204\tessdata/eng.pffmtable 
C:\tesseract204\tessdata/eng.unicharset 
C:\tesseract204\tessdata/eng.DangAmbigs

step 1b: for window users intending to do training, we need additional files in 
two folders "configs" and "tessconfigs" ; these two are available in the source 
http://tesseract-ocr.googlecode.com/files/tesseract-2.04.tar.gz
using a suitable unpakcer (7-zip or peazip) unpack the two folders from 
tessdata inside the tesseract-2.04 folder in the source tar ball.

step-2 for more versatile users who want to use the libtiff  we need to 
recompile and build the executables. one needs the source 
(tesseract-2.04.tar.gz), libtiff (see 
http://gnuwin32.sourceforge.net/packages/tiff-win32.htm for details and get 
full download using http://gnuwin32.sourceforge.net/downlinks/tiff.php) and VC 
express 2008 (web install from 
http://msdn.microsoft.com/en-us/express/future/bb421473 remember it is 
webinstall and will take a good internet time to download the total ~92mb; 
offline image iso from microsoft  is much bigger 
as it contains the complete visual studio; you can get individual offline 
images vide the blog: 
http://vicker313.wordpress.com/2008/11/26/how-to-offline-install-visual-studio-e
xpress-without-download-the-whole-image-file/  

1) Install libtiff. On 64 bit win-7 system the suggested install directory is 
C:\Program Files (x86)\GnuWin32; for 32 bit win-7 or xp it is c:\program 
files\gnuwin32. Underneath this directory are a bunch of subdirectories 
containing files we'll need to compile tesseract with tiff support, namely 
include, bin and 

lib. Add C:\Program Files (x86)\GnuWin32\bin to your PATH environment variable 
so that the output tesseract.exe can find the libtiff dll. )this is done from 
control panel /system /advanced and selecting environment variables)

2) if you have not used webinstall for VC++2008 do the offline install of VC

3)  Unpack the source (tesseract-2.04.tar.gz). In this example I've unpacked to 
C:\projects\tesseract-2.04. (Windows 7 /win xp will not understand .tar.gz out 
of the box. My recommendation is to get a copy of 7-Zip.)

4.Download your required language files.  (note: since we are now talking of 
tesseract 2.04, donot use language packs meant for version 3.00 and up) Unpack 
these to the tessdata subdirectory of C:\projects\tesseract-2.04\tessdata.

5. restart the machine. (so that the environment variable and path are 
understood)

6.Open the vc solution (tesseract.sln) (Double click the 
C:\projects\tesseract-2.04\tesseract.sln)

7.Now the Visual studio opens the VC GUI with solution explorer in the left 
panel (if not press CTRL+ALT+L or select "solution explorer" from "view" menu. 
The solution explorer shall show Soluton 'tesseract' (7 projects).  

 In the Icon/menu strip you will see a drop-downlist with Debug as the default option. this is " solution configuration " Change the solution configuration to "Release" mode from the drop-down list. Note that if you later change back to Debug mode, you'll need to set up all the following again...

8.In the solution explorer right click the solution node (Solution 'tesseract') 
and click "Properties". This will opne a pop-up window title "solution 
tesseract property page"  and the left panel in the pop-up window will have 
"common properties" and 'Configuration properties'.  Change to "Configuration 
Properties" and select / confirm "Release" configuration from the dropdown at 
the top of the window.  Press ok to close the property window. 

9) Navigate to: Tools -> Options. This will open a pop-up window titled 
"Options";  select from the left panel -> Projects and Solutions -> VC++ 
Directories Here we'll be adding the full paths for the subdirectories lib and 
include from the libtiff install so that VC can find the required header (.h) 
and static library (.lib) files. In this example they are: 
$(ProgramFiles?)\GnuWin32?\include $(ProgramFiles)\GnuWin32\lib as I'm using an 
environment variable. I could however just have written them as C:\Program 
Files (x86)\GnuWin32?\include. 

Change the "Show Directories For" dropdown to "Include files". Add the 
following: $(ProgramFiles)\GnuWin32\include Now change the "Show Directories 
For" dropdown to "Library files". Add the following: 
$(ProgramFiles)\GnuWin32\lib

10. .Now open the project properties window for the tesseract project (Note: 
seven projects are listed in the solution explorer; cntraining, dlltest, 
mftraining, tessdll, tesseract, unicharset_Extractor, wordlist2dawg; select the 
tesseract project by using the mouse and rightclick to open the properties 
page; this will open a pop-up "tesseract Propety Page)  Navigate the horrendous 
list of options to Configuration Properties -> C/C++ -> Preprocessor . In the 
right panel you will see Preprocessor Definitions and a list ; click on that 
which will open an editable list; add HAVE_LIBTIFF to the list of Preprocessor 
Definitions. This causes a bunch of #includes to be enabled in the code.

11. You also need to add an "Additional dependency". go to the "Additional 
dependencies" section for the project properties (in the tesseract project) 
Select the property page and  the opened dialog, select “Configuration 
Properties > Linker > Input > Additional Dependencies”   and add libtiff.lib. 
close the property window using "apply". (this is clarified by Antonio rubby, 
vide: 
http://social.msdn.microsoft.com/Forums/en/Vsexpressvc/thread/cee0448f-5435-4fc1
-85f0-ae18fb71944d

12.Build the solution. Watch the error list. If you get a bunch of LNK2109 
errors, that means the linker can't find something tesseract references. You're 
missing a reference to one of the paths from libtiff. If you get an error 
mentioning mt.exe, you've possibly encountered a bug in the sdk. Just try 
building again. see 
http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=
106634 for more info.

If/when the solution builds successfully, you'll have a tesseract.exe file in 
the same directory as the tesseract solution file. drag you multipage 
compressed tiff here and try running tesseract. 

Hopefully (fingers crossed, heh) you've now got an OCR'd out.txt file sitting 
in C:\projects\tesseract-2.04.

Original comment by rnkan...@gmail.com on 27 Mar 2012 at 4:22

GoogleCodeExporter commented 9 years ago
2.04 was released in June 2009. Now it is March 2012 and process to release of 
3.02 version already started (see forums)...
2.04 supported 5 languages 3.02 will support 68 languages. API changes from 
that time etc...

I hope you have a very good reason to spent time with unsupported version.

Original comment by zde...@gmail.com on 27 Mar 2012 at 6:56