Closed GoogleCodeExporter closed 9 years ago
Delphi 7 and XP. :p
Original comment by rfwo...@gmail.com
on 12 Jan 2011 at 12:25
there is always a debate on c and c++. For DLLs and Static Objects, c is always
preferred because it is faster and more portable. The ideal way to develop a
software is to write in c then wrap it in c++. gtk is always following this
direction while unfortunately tesseract 3 is heading for another.
So, to wrap it to other programming language, say python, one may inevitably be
required to go through some tedious steps to wrap the c++ class in tesseract 3
back to c library. Of which, the especially troublesome class is "STRING"
derived from apache.
rtwoolf> As I have used neither Delphi nor XP, it may take time for me to
explore what is going on. Will keep u inform the progress.
Original comment by FreeT...@gmail.com
on 12 Jan 2011 at 12:48
"So, to wrap it to other programming language, say python, one may inevitably
be required to go through some tedious steps to wrap the c++ class in tesseract
3 back to c library."
I'm a little confused. by the looks of it you got everything working in Python,
but you didn't recompile the DLL in C.
Original comment by rfwo...@gmail.com
on 12 Jan 2011 at 1:00
I'm wondering whether it would be best to include SWIG wrappers for certain
languages (Java, Python, C) with Tesseract itself or whether it would be better
to maintain these separately. I suspect also having to support Java and Python
would complicate the build process. Also, the C version requires a forked
version of SWIG. Any thoughts, Jimmy?
Original comment by JerseyChewi@gmail.com
on 12 Jan 2011 at 2:01
Android already includes JNI bindings for tesseract; using SWIG for it would be
an unneccessary duplication of effort.
I really couldn't care less about Python, so I'm not going to register an
opinion for or against, but I will note that the current trend seems to be to
prefer using ctypes.
The main reason why having a C binding is so appealing is that most other
languages have facilities for using C libraries, which would reduce the effort
in making language bindings all round, so it seemed like having a C wrapper was
the best way in general.
Original comment by joregan
on 12 Jan 2011 at 2:25
True, I'd forgotten about the Android port but I don't think this has been
updated for Tesseract 3?
Original comment by JerseyChewi@gmail.com
on 12 Jan 2011 at 3:04
"The main reason why having a C binding is so appealing is that most other
languages have facilities for using C libraries" - exactly!
Original comment by rfwo...@gmail.com
on 12 Jan 2011 at 3:10
I wasn't questioning whether we should have C bindings at all. It's just that
SWIG supports quite a few languages and we'll get better results for less work
if we target those directly instead of going via what it generates for C.
Original comment by JerseyChewi@gmail.com
on 12 Jan 2011 at 3:17
The Android port was updated to Tesseract 3 quite some time ago. Before
Tesseract 3 was released, in fact.
Original comment by joregan
on 12 Jan 2011 at 3:17
[deleted comment]
Hello, I tried to install the SWIG package and I am not able to build. There
were lot of errors during the build. The last error was:
publictypes.h:96: error: âtesseract::PSM_COUNTâ has a previous declaration as
âtesseract::PageSegMode tesseract::PSM_COUNTâ
error: command 'gcc' failed with exit status 1
I was using the command "python setup.py build". Any of you had a simular
issue? Please advise. Thank you for your time.
Original comment by vijay111...@gmail.com
on 19 Feb 2011 at 7:43
svn changed.
Try http://code.google.com/p/python-tesseract/downloads/list
Will look into it when I am free.
Original comment by FreeT...@gmail.com
on 20 Feb 2011 at 4:55
I have tried the svn version of tesseract-ocr today vs the swig_svn.7z in
http://code.google.com/p/python-tesseract/downloads/list.
python setup.py build don't yield any problem. For your information, I am using
Maverick Ubuntu
Original comment by FreeT...@gmail.com
on 20 Feb 2011 at 5:18
Hello everybody,
I have written a small C Wrapper (not complete but covers the most important
part).
I would like to share it, and ideally it would be included in the project.
It is based on tesseract 3.01, so if there are any major changes in the C++
API, probably it would need some changes.
Comments are welcome!
Original comment by trop...@gmail.com
on 2 Apr 2012 at 8:11
Attachments:
BTW, those files just have to be added to the project alongside baseapi.h/.cpp
Original comment by trop...@gmail.com
on 2 Apr 2012 at 8:25
I put the files in api folder and included them in tesseract project (r639).
However, the compiler generated > 100 errors, most of which are as following:
Error 1 error C2143: syntax error : missing ';' before
'const' c:\projects\tesseract-3.0.1\api\capi.h 96 tesseract
Error 2 error C4430: missing type specifier - int assumed. Note: C++ does not
support default-int c:\projects\tesseract-3.0.1\api\capi.h 96 tesseract
Error 3 error C2144: syntax error : 'void' should be preceded by
';' c:\projects\tesseract-3.0.1\api\capi.h 98 tesseract
Error 5 error C2086: 'int TESSDLL_API' :
redefinition c:\projects\tesseract-3.0.1\api\capi.h 98 tesseract
Original comment by nguyen...@gmail.com
on 3 Apr 2012 at 2:46
Oh that's unfortunate. I made a last minute change without testing it.
Here's the new one.
Anyway, are you compiling the 3.01 project? In 3.01 (Windows) there is not yet
a project for a dll, only the executable.
What I have done is just changed the "tesseract" project from "Executable" to
"DLL" in the preferences and defined TESSDLL_EXPORTS also in the Project
settings.
Additionally you probably want to change the output name from tesseract.exe to
tesseract.dll or similar.
This is obviously a hack, in the long term you would a separate project for the
DLL. But I believe this is already done in SVN.
Original comment by trop...@gmail.com
on 3 Apr 2012 at 7:12
Attachments:
I've been able to build the DLL with both 3.01 and, with little change, 3.02
alpha. However, I'm not sure if it was built correctly as my Java program
cannot look up the exposed C methods. I'll come back to it when I have more
time.
Meanwhile, can you attach a copy of your C DLL so we can try out? Thanks.
Original comment by nguyen...@gmail.com
on 4 Apr 2012 at 2:22
You need to define TESSERACT_EXPORTS in the project properties, otherwise the C
Functions are not exported.
I've attached a copy of my DLL. (One Debug and one Release)
Note that I've built the DLLs with VS 2005, so there is a bit of a dependency
hell regarding MSVCRxx.dll. You need both, MSVRC80.dll (for the VS 2005
compiled objects) and MSVCR90.dll (for the VS 2008 objects that came with the
source).
For the release version, you probably already have those, for the debug version
it's a bit more difficult.
Those files now are Release.Dynamic
Original comment by trop...@gmail.com
on 5 Apr 2012 at 8:28
Attachments:
And now Debug:
(omitted the PDB, its seems to be too large)
Original comment by trop...@gmail.com
on 5 Apr 2012 at 8:30
Attachments:
[deleted comment]
Thanks, Troplin. I tried all your files and suggestions but still nothing
worked. After spending some time digging into the old tessdll source code of
Tess 2.04, I made a single change to capi.h from:
define TESSDLL_CALL __stdcall
to:
define TESSDLL_CALL __cdecl
then I began to be able to call the exported C functions from my Java wrapper.
I don't understand the significance of this change since I'm not a C/C++
developer.
In preliminary tests with Tess 3.02, the OCR output text appeared be accurate
with the test images.
Original comment by nguyen...@gmail.com
on 6 Apr 2012 at 9:15
It is just a different calling convention.
If you want to call the function from Java, you need use the same calling
convention as declared in the C-API.
What technique are you using for your Java Wrapper? JNI, JNA, or something
other?
Usually you can declare the calling convention where you declare the function
prototype.
_stdcall is the standard for all Microsoft Win32 APIs.
_cdecl is the standard for C programs
Both have there advantages and disadvantages, I think it's a matter of taste
what to use.
_cdecl makes less problems in combination with product from Non-Microsoft
vendors (e.g. MinGW, Java, etc)
_stdcall is better suited if you are calling from MS-Products (like .NET, VB6)
Original comment by trop...@gmail.com
on 10 Apr 2012 at 9:34
Which function in capi.h is called to do the OCR? TessBaseAPIProcessPages? Have
u defined TESSDLL_INCLUDE_BASEAPI?
Could u be kind enough to brief me how to make your java wrapper?
Original comment by FreeT...@gmail.com
on 10 Apr 2012 at 1:25
The functions in the C-API are the same as those in the C++ API.
Documentation is in baseapi.h.
I usually do the following sequence:
1. TessBaseAPICreate
2. TessBaseAPIInit3
3. TessBaseAPISetPageSegMode
4. TessBaseAPISetImage
5. TessBaseAPIRecognize
6. TessBaseAPIGetIterator
... (extract text from iterators)
X. TessBaseAPIDelete
Original comment by trop...@gmail.com
on 10 Apr 2012 at 2:55
python-tesseract for windows
http://python-tesseract.googlecode.com/files/python-tesseract-0.7.win32-py2.7.ex
e
Original comment by FreeT...@gmail.com
on 11 Apr 2012 at 6:56
A comment on TESSDLL_INCLUDE_BASEAPI and TESSDLL_INCLUDE_LEPTONICA:
TESSDLL_INCLUDE_BASEAPI:
Only define this, if you are using the C-API in C++.
If defined, all datatypes from the BaseAPI can be used. C and C++ API can be
mixed freely.
TESSDLL_INCLUDE_LEPTONICA:
Enables the use of the Leptonica datatypes.
Original comment by trop...@gmail.com
on 12 Apr 2012 at 9:14
Here is new version of the C Wrapper.
Changes:
- Use __cdecl instead of __stdcall, this seems to be more convenient.
- Includes all functions using Leptonica datatypes per default.
- Forward declaration of Leptonica datatypes instead of header file inclusion.
- Added missing "SetVariable" function
- Use array-delete (delete []) instead of scalar delete for strings and int
arrays.
Original comment by trop...@gmail.com
on 12 Apr 2012 at 1:13
Attachments:
Forget to thank trop for your good works. Late is better than never.
Thanks a lot.
Original comment by FreeT...@gmail.com
on 12 Apr 2012 at 4:41
Feedback anyone?
Are there any chances to integrate it into the main repo?
Any objections regarding the names or the coding style?
Original comment by trop...@gmail.com
on 16 Apr 2012 at 7:55
Troplin, will this C wrapper also work on Linux?
I'm developing a JNA wrapper based on this C API (http://tess4j.sf.net). I'm
close to releasing a beta once I figure out why recognizing a rectangle cuts
off some words at the right edge of the image.
Other than that, the capi.cpp/.h looks fine. IMHO, for it to be included in the
current baseline, it needs to be updated to Tesseract 3.02 API, which changed a
little bit from 3.01. Additionally, a short demo program (similar to dlltest in
2.04) to test the C API would be nice to have.
Original comment by nguyen...@gmail.com
on 16 Apr 2012 at 1:14
Great!
I can do the update to the current SVN, that's no problem. Also the demo
program if I find some free time. However it will be difficult to cover all
functionality, the new C API is much bigger than the old one.
Original comment by trop...@gmail.com
on 16 Apr 2012 at 2:56
Oh and yes, it should also work on Linux.
Original comment by trop...@gmail.com
on 16 Apr 2012 at 2:56
Nguyen,
you are using the 3.02 SVN version for your JNA wrapper. Can your recommend
that over 3.01 release?
Using 3.02 SVN would be much easier for me, because there is already a
VS-Project for the DLL.
Original comment by trop...@gmail.com
on 24 Apr 2012 at 9:45
I only recently started working on the Java wrapper after you made the C
wrapper available. Naturally, I use the current version of Tesseract, which
happens to be 3.0.2 and which I know has undergone significant file
restructuring recently to make it more maintainable and easier to build. I also
understand that 3.0.2 was planned for release together with the upcoming
release of a popular Linux distro. Let's hope that the C API is also included.
Original comment by nguyen...@gmail.com
on 26 Apr 2012 at 12:00
Well, for me as an "external" person, it is natural to work with the current
stable version, which is 3.01.
Regarding the inclusion of the C API, I think just hoping is not enough.
Do you have some more concrete information about the release date?
Who is responsible for the decision? Maybe I can convince him/her.
Original comment by trop...@gmail.com
on 26 Apr 2012 at 8:26
Zdenko just asked the same question in the Forum.
http://groups.google.com/group/tesseract-ocr/t/ef8c6819fc5385f
Original comment by nguyen...@gmail.com
on 27 Apr 2012 at 4:45
So Ray Smith is the one?
BTW, I just ported the tesseractmain.cpp to C (using the C API), is that
sufficient as an example?
Original comment by trop...@gmail.com
on 27 Apr 2012 at 8:17
And again a new version of the C API.
Changes:
- Converted the TessBaseAPIProcessPage and TessBaseAPIProcessPages functions to
a pure C interface.
- Renamed TessBaseSetInputName to TessBaseAPISetInputName and
TessBaseSetOutputName to TessBaseAPISetOutputName
Original comment by trop...@gmail.com
on 27 Apr 2012 at 8:33
Attachments:
And here's the C sample. You need the lates capi.h/.cpp for that.
Original comment by trop...@gmail.com
on 27 Apr 2012 at 10:17
Attachments:
I *strongly* suggest you move this entire discussion over to a new thread on
the tesseract-dev group [1]. Not everyone follows all the issues here (and
while I read earlier entries in this particular issue the auto-email system
didn't seem to inform me of updates).
Creating an additional "official" API is a serious business. It would have to
be vetted by more than just the people who happen to follow this issue.
[1] http://groups.google.com/group/tesseract-dev
Original comment by tomp2...@gmail.com
on 27 Apr 2012 at 8:27
I improved a bit further by getting rid of references to STRING and FILE native
types to make the C API more amenable to other languages, such as Java, which
do not have direct equivalents.
I found that raw image data type would still occasionally crash Tesseract
engine. I want to try the approach of converting the raw image to Pix to see if
I can get more stability.
TessBaseAPISetImage(TessBaseAPI* handle, const unsigned char* imagedata, int
width, int height, int bytes_per_pixel, int bytes_per_line)
{
//handle->SetImage(imagedata, width, height, bytes_per_pixel, bytes_per_line);
Pix *pix = convert2Pix(imagedata);
TessBaseAPISetImage2(handle, pix);
}
but I'm not sure how to implement convert2Pix method and whether the rest of
the parameters are needed in the Pix conversion.
Original comment by nguyen...@gmail.com
on 5 May 2012 at 6:00
Attachments:
No promises but the code to convert to Pix might be found in
tesseract-android-tools.
http://code.google.com/p/tesseract-android-tools/source/browse/
Original comment by JerseyChewi@gmail.com
on 5 May 2012 at 7:04
Nguyen,
I can't think why the raw data should crash the Tesseract engine, but the fact
that you ask if those additional parameters are necessary makes me a bit
suspicious. Yes, they are necessary and it is not possible to convert the data
to PIX without them, because without them you cannot possibly know how to
interpret the data.
Did you really pass those correctly to the method in the cases that crash?
Original comment by trop...@gmail.com
on 6 May 2012 at 4:06
BTW, conversion from raw data to Pix should be straight forward:
TessBaseAPISetImage(TessBaseAPI* handle, const unsigned char* imagedata, int
width, int height, int bytes_per_pixel, int bytes_per_line)
{
Pix pix;
memset(&pix, 0, sizeof(pix)); // Set all members to 0
pix.data = imagedata;
pix.w = width;
pix.h = height;
pix.d = bytes_per_pixel == 0 ? 1 : bytes_per_pixel * 8;
if (bytes_per_line % 4 != 0)
return; // ERROR: Lines must be word-aligned
pix.wpl = bytes_per_line / 4;
TessBaseAPISetImage2(handle, pix);
}
(not tested!)
As you can see, all parameters have some equivalent field in the Pix data
structure. In the end, inside tesseract, the data from Pix is just converted
back to the "raw" format. There's no magic going on with Pix.
Original comment by trop...@gmail.com
on 7 May 2012 at 7:23
of course it should be:
TessBaseAPISetImage2(handle, &pix);
Original comment by trop...@gmail.com
on 7 May 2012 at 7:25
[deleted comment]
I tried but could not get the code to compile. Finally, the crashing stopped
when I switched the .traineddata file from 3.0 or 3.01 to one compatible with
Tess 3.02. Thanks for going through all the troubles. The last uploaded version
is still good and clean (not polluted with hacks). Hope the project owner will
soon accept and incorporate into the main trunk.
And lastly, how to get the files compiled to generate libtesseract302.so on
Linux? Has anyone tried?
Original comment by nguyen...@gmail.com
on 19 Jun 2012 at 3:18
I'm currently trying to compile it on my mac, but I haven't quite groked the
makefile, could you share yours with us :)?
Original comment by janpaulb...@googlemail.com
on 28 Jun 2012 at 10:59
Original issue reported on code.google.com by
nguyen...@gmail.com
on 26 Sep 2010 at 4:20