Closed jflesch closed 8 years ago
Have a look at "old example" (for 3.02 version)[1] My test with C-API (I am not expert at python and c-types ;-) ) could be find at pastebin [2] and [3]. There were also other projects [4] [5] that try to use C-API, but I am not sure about their status. At leaset you have check them for inspiration
[1] https://github.com/tesseract-ocr/tesseract/blob/master/contrib/tesseract-c_api-demo.py [2] http://pastebin.com/DhPUgrAj [3] http://pastebin.com/yDTkNfNm [4] https://code.google.com/p/python-tesseract/ [5] https://github.com/virtuald/python-tesseract-sip
Thanks for the resources :)
Just FIY, it seems python-tesseract and python-tesseract-sip actually use the C++ API. The sources of the first one seem really messy (there are .pyc and .zip files and even an "ez_setup (1).py" in their repository ...), and the second one requires an extra dependency (python-sip). I think that I already saw them before working on Pyocr.
Anyway, if there is a C API, I can do my own binding easily.
The C++ API has a function TessBaseAPI::ProcessPagesFileList() that can take a buffer as input. However, the C API doesn't provide any function to pass an in-memory image. It can only process on-disk files.
So basically, with the current API, either I use a temporary file + tesseract API, or I use stdin/stdout + fork/exec(tesseract) ...
There should be option to pass PIX structure to C-API - see my test (from several year ago ;-) ) with tesseract.TessBaseAPISetImage2(api, pix_image).
Also you can benefit from PIX structure, so you can use leptonica for improving scanned image (e.g. for orientation detection, dewarping, deskewing ... )
Most easy way to create PIX structure in python should be opening file by leptonica. Other option would be to convert your python image data (pillow?) to PIX.
Yep sorry. I saw struct Pix;
in tesseract/capi.h and I didn't realize it is actually coming from another library.
No problem. I just find out that you can easily create PIX
def pil2PIX(im, leptonica):
if im.mode == "RGB":
pass
elif im.mode in ("L", "P"):
im = im.convert("RGBA")
depth = 32
pixs = leptonica.pixCreate(im.size[0], im.size[1], depth)
data = im.convert("RGBA").tostring("raw", "RGBA")
leptonica.pixSetData(pixs, data)
try:
resolutionX = im.info['resolution'][0]
resolutionY = im.info['resolution'][1]
leptonica.pixSetResolution(pixs, resolutionX, resolutionY)
except KeyError:
pass
try:
resolutionX = im.info['dpi'][0]
resolutionY = im.info['dpi'][1]
leptonica.pixSetResolution(pixs, resolutionX, resolutionY)
except KeyError:
pass
return leptonica.pixEndianByteSwapNew(pixs)
where im
is PIL/Pillow image and leptonica = ctypes.cdll.LoadLibrary('liblept.so')
Almost there. The only remaining problem I have left is to get the empty lines (I'm looking at TessBaseAPI::GetHOCRText()
for reference). Then it's cleanup time ! :)
And done ! :-) Next, I just have to release Pyocr 0.4.0
As suggested by zdenop on https://github.com/tesseract-ocr/tesseract/issues/85#issuecomment-139132765 , using the C API could have many advantages:
To check: thread safety.
https://github.com/tesseract-ocr/tesseract/wiki/APIExample#c-api-in-python