Tesseract C API : Digits-only (DigitBuilder)

bnguyenvanyen commented 8 years ago

Hello, I'd like to use tesseract with a numerical input, but as it is this is only possible with the tesseract command line tool and its DigitBuilder, since f36f249217a9f28b3a67824a9e35ff9f0cbe59be

However, this looks easy enough to implement with the C API too, with a new function in libtesseract/tesseract_raw.py :

def set_numeric_only(handle) :
    global g_libtesseract
    assert(g_libtesseract)

    g_libtesseract.TessBaseAPISetVariable(
        ctypes.c_void_p(handle),
        b"classify_bln_numeric_mode",
        b"1"
    )

The most conservative way would be to use it in a new builder subclass in libtesseract/__init__.py , in the same was as for tesseract.py.

But I think it might be better to move this to image_to_string both in libtesseract/__init__.py and tesseract.py, with a new option, like it's done for choosing the language, since from what I understand builders should be more for choosing the format of the output.

I am not too familiar with github, ctypes, or pyocr, so sorry if I'm misunderstanding the code or doing something wrong.

Thank you for your work on this package, Regards

PS : It looks like the C API also offers possibilities for getting confidence scores for words, which might be interesting to get to a Builder.

jflesch commented 8 years ago

The most conservative way would be to use it in a new builder subclass in libtesseract/init.py , in the same was as for tesseract.py.

The most consistent too :)

from what I understand builders should be more for choosing the format of the output

Actually, a builder would be just fine.

If things were done perfectly, I think the option language should actually be an option of some of the builders. However, this is a remain from a long time ago, before I started implementing builder objects .. so yeah, language is an option of image_to_string(). I guess I will have to change it at some point (but I have to be careful about not breaking programs already using PyOCR).

sorry if I'm misunderstanding the code or doing something wrong.

Actually, you took the time to understand the code and ask questions, which, from my point of view, is awesome :)

Do you feel like implementing this builder ? :-)

bnguyenvanyen commented 8 years ago

Hi,

Ok I'm giving it a shot. Since the DigitBuilder in tesseract.py is a subclass of TextBuilder but I need a LineBoxBuilder behaviour, this would mean several builders for consistency.

Do you think it makes sense to instead put in a num_mode option for all builders in __init__ which is then handled in image_to_string and raises a NotImplementedError if the tool (Cuneiform) can't handle it ?

Also I'm on debian stable (jessie) and the last libtesseract version is 3.03, with no backport for 3.04. I have seen no problem with it but pyocr doesn't like it. So I'm making 3.03 valid and you'll revert the change if you think it's unneeded.

Cheers

jflesch commented 8 years ago

Since the DigitBuilder in tesseract.py is a subclass of TextBuilder but I need a LineBoxBuilder behaviour, this would mean several builders for consistency.

If you make it inherit from LineBoxBuilder (--> it returns boxes, including lines and words), it should be called DigitLineBoxBuilder, not DigitBuilder. For me, a DigitBuilder will return a string with only digits, not boxes.

Do you think it makes sense to instead put in a num_mode option for all builders

It could make sense too. But since we started with DigitBuilder, I think we should keep going with it.

If we want to keep things clean, here is what I think would be best:

You move DIgitBuilder to src/pyocr/builders.py, and make it work with libtesseract too. With Cuneiform, either it just work as TextBuilder or it raises an exception. Both behaviors are acceptable for me.
You create a DigitLineBoxBuilder to src/pyocr/builders.py, and make it work with tesseract and libtesseract too. With Cuneiform, either it just works as a LineBoxBuilder, or it raises an exception. Again, both behaviors look fine to me
For backward compatibility, in tesseract.py, you will have to add 'from pyocr.builders import DigitBuilder'.
Please try to add tests. Since you're running Debian stable like me, tests should run fine (see run_tests.py ; there is just one test with french OCR that is randomly working and driving me crazy), so you can add new ones easily. They must work with Python 2.7 and Python 3.x.

I can take care of the documentation in the README if you want.

jflesch commented 8 years ago

Or maybe a simpler option just for now:

Just add a DigitLineBoxBuilder in libtesseract only. It can be moved later when someone will need it with tesseract. (still need tests ;)

jflesch commented 7 years ago

Idem, this bug has been fixed by your latest push request if I remember correctly ?

bnguyenvanyen commented 7 years ago

It is indeed.

openpaperwork / pyocr

Tesseract C API : Digits-only (DigitBuilder) #47