openpaperwork / pyocr

A Python wrapper for Tesseract and Cuneiform -- Moved to Gnome's Gitlab
https://gitlab.gnome.org/World/OpenPaperwork/pyocr
930 stars 152 forks source link

Support for digit-only OCR #17

Closed torre76 closed 8 years ago

torre76 commented 10 years ago

Hello,

I am using your library in conjunction with Tesseract to recognize digit-only images. On the first try, Tesseract had some issues with some digit like "0" taken as "D" and so on until I notice there is a parameter for Tesseract to instruct it that the image contains only digit. Doing so the recognition is perfect (99%).

To activate this feature (that is, adding digits to the Tessetact command line), I subclassed the Text Builder this way:


class DigitBuilder(TextBuilder):
            """
                Specialization for Tesseract to use Digit Only recognition
            """

            def __init__(self, tesseract_layout=3):
                self.tesseract_configs = ["-psm", str(tesseract_layout), "digits"]

I would like to write a pull request on it, but I do not know how you manage the builders and if Cuneiform has a similar feature.

If you provide me some hints I will surely help this useful project. Regards

jflesch commented 10 years ago

(sorry for the late reply)

There is already a builder specific to Tesseract: CharBoxBuilder. It's in tesseract.py. For now, you can just add yours below this one. Just send me a pull request and I will integrate it (if I don't see any problem with it, of course). If at some point I figure out a way to do the same thing with Cuneiform, I will move it to builders.py.

Based on the snippet you gave here, I suggest you be careful regarding indentation (PEP8 says 4 spaces, not 12 :). If possible, please call the program 'pep8' on the file you modify before submitting your changes.

torre76 commented 10 years ago

Hi,

I will surely make a pull request about that. Do you prefer a separated branch you could merge after some checks?

Do not bother for PEP8 I use PyCharm which is PEP8 compliant. The snippet I gave you was roughly copy - pasted and adapted. I will surely use pep8 cmd before pushing ;).

jflesch commented 10 years ago

Nah, no need for separated branch (unless you want to). I will do the checks/review before accepting your pull request anyway :)

jflesch commented 8 years ago

f36f2492