openpaperwork / pyocr

A Python wrapper for Tesseract and Cuneiform -- Moved to Gnome's Gitlab
https://gitlab.gnome.org/World/OpenPaperwork/pyocr
930 stars 152 forks source link

Tesseract C API #30

Closed jflesch closed 8 years ago

jflesch commented 9 years ago

As suggested by zdenop on https://github.com/tesseract-ocr/tesseract/issues/85#issuecomment-139132765 , using the C API could have many advantages:

To check: thread safety.

https://github.com/tesseract-ocr/tesseract/wiki/APIExample#c-api-in-python

zdenop commented 9 years ago

Have a look at "old example" (for 3.02 version)[1] My test with C-API (I am not expert at python and c-types ;-) ) could be find at pastebin [2] and [3]. There were also other projects [4] [5] that try to use C-API, but I am not sure about their status. At leaset you have check them for inspiration

[1] https://github.com/tesseract-ocr/tesseract/blob/master/contrib/tesseract-c_api-demo.py [2] http://pastebin.com/DhPUgrAj [3] http://pastebin.com/yDTkNfNm [4] https://code.google.com/p/python-tesseract/ [5] https://github.com/virtuald/python-tesseract-sip

jflesch commented 9 years ago

Thanks for the resources :)

Just FIY, it seems python-tesseract and python-tesseract-sip actually use the C++ API. The sources of the first one seem really messy (there are .pyc and .zip files and even an "ez_setup (1).py" in their repository ...), and the second one requires an extra dependency (python-sip). I think that I already saw them before working on Pyocr.

Anyway, if there is a C API, I can do my own binding easily.

jflesch commented 8 years ago

The C++ API has a function TessBaseAPI::ProcessPagesFileList() that can take a buffer as input. However, the C API doesn't provide any function to pass an in-memory image. It can only process on-disk files.

So basically, with the current API, either I use a temporary file + tesseract API, or I use stdin/stdout + fork/exec(tesseract) ...

jflesch commented 8 years ago

Ticket opened for tesseract-ocr

zdenop commented 8 years ago

There should be option to pass PIX structure to C-API - see my test (from several year ago ;-) ) with tesseract.TessBaseAPISetImage2(api, pix_image).

Also you can benefit from PIX structure, so you can use leptonica for improving scanned image (e.g. for orientation detection, dewarping, deskewing ... )

Most easy way to create PIX structure in python should be opening file by leptonica. Other option would be to convert your python image data (pillow?) to PIX.

jflesch commented 8 years ago

Yep sorry. I saw struct Pix; in tesseract/capi.h and I didn't realize it is actually coming from another library.

zdenop commented 8 years ago

No problem. I just find out that you can easily create PIX

def pil2PIX(im, leptonica):
    if im.mode == "RGB":
        pass
    elif im.mode in ("L", "P"):
        im = im.convert("RGBA")
    depth = 32
    pixs = leptonica.pixCreate(im.size[0], im.size[1], depth)
    data = im.convert("RGBA").tostring("raw", "RGBA")
    leptonica.pixSetData(pixs, data)

    try:
        resolutionX = im.info['resolution'][0]
        resolutionY = im.info['resolution'][1]
        leptonica.pixSetResolution(pixs, resolutionX, resolutionY)
    except KeyError:
        pass
    try:
        resolutionX = im.info['dpi'][0]
        resolutionY = im.info['dpi'][1]
        leptonica.pixSetResolution(pixs, resolutionX, resolutionY)
    except KeyError:
        pass

    return leptonica.pixEndianByteSwapNew(pixs)

where im is PIL/Pillow image and leptonica = ctypes.cdll.LoadLibrary('liblept.so')

jflesch commented 8 years ago

Almost there. The only remaining problem I have left is to get the empty lines (I'm looking at TessBaseAPI::GetHOCRText() for reference). Then it's cleanup time ! :)

jflesch commented 8 years ago

And done ! :-) Next, I just have to release Pyocr 0.4.0