Closed bnguyenvanyen closed 7 years ago
The most conservative way would be to use it in a new builder subclass in libtesseract/init.py , in the same was as for tesseract.py.
The most consistent too :)
from what I understand builders should be more for choosing the format of the output
Actually, a builder would be just fine.
If things were done perfectly, I think the option language should actually be an option of some of the builders. However, this is a remain from a long time ago, before I started implementing builder objects .. so yeah, language is an option of image_to_string(). I guess I will have to change it at some point (but I have to be careful about not breaking programs already using PyOCR).
sorry if I'm misunderstanding the code or doing something wrong.
Actually, you took the time to understand the code and ask questions, which, from my point of view, is awesome :)
Do you feel like implementing this builder ? :-)
Hi,
Ok I'm giving it a shot. Since the DigitBuilder
in tesseract.py is a subclass of TextBuilder
but I need a LineBoxBuilder
behaviour, this would mean several builders for consistency.
Do you think it makes sense to instead put in a num_mode
option for all builders
in __init__
which is then handled in image_to_string
and raises a NotImplementedError
if the tool (Cuneiform) can't handle it ?
Also I'm on debian stable (jessie) and the last libtesseract version is 3.03, with no backport for 3.04. I have seen no problem with it but pyocr doesn't like it. So I'm making 3.03 valid and you'll revert the change if you think it's unneeded.
Cheers
Since the DigitBuilder in tesseract.py is a subclass of TextBuilder but I need a LineBoxBuilder behaviour, this would mean several builders for consistency.
If you make it inherit from LineBoxBuilder (--> it returns boxes, including lines and words), it should be called DigitLineBoxBuilder, not DigitBuilder. For me, a DigitBuilder will return a string with only digits, not boxes.
Do you think it makes sense to instead put in a num_mode option for all builders
It could make sense too. But since we started with DigitBuilder, I think we should keep going with it.
If we want to keep things clean, here is what I think would be best:
run_tests.py
; there is just one test with french OCR that is randomly working and driving me crazy), so you can add new ones easily. They must work with Python 2.7 and Python 3.x.I can take care of the documentation in the README if you want.
Or maybe a simpler option just for now:
Just add a DigitLineBoxBuilder in libtesseract only. It can be moved later when someone will need it with tesseract. (still need tests ;)
Idem, this bug has been fixed by your latest push request if I remember correctly ?
It is indeed.
Hello, I'd like to use tesseract with a numerical input, but as it is this is only possible with the tesseract command line tool and its
DigitBuilder
, since f36f249217a9f28b3a67824a9e35ff9f0cbe59beHowever, this looks easy enough to implement with the C API too, with a new function in libtesseract/tesseract_raw.py :
The most conservative way would be to use it in a new builder subclass in libtesseract/__init__.py , in the same was as for tesseract.py.
But I think it might be better to move this to
image_to_string
both in libtesseract/__init__.py and tesseract.py, with a new option, like it's done for choosing the language, since from what I understand builders should be more for choosing the format of the output.I am not too familiar with github, ctypes, or pyocr, so sorry if I'm misunderstanding the code or doing something wrong.
Thank you for your work on this package, Regards
PS : It looks like the C API also offers possibilities for getting confidence scores for words, which might be interesting to get to a Builder.