openpaperwork / pyocr

A Python wrapper for Tesseract and Cuneiform -- Moved to Gnome's Gitlab
https://gitlab.gnome.org/World/OpenPaperwork/pyocr
930 stars 152 forks source link

preserve_interword_spaces in tesseract #84

Open anilnaik1988 opened 7 years ago

anilnaik1988 commented 7 years ago

Hi Team, Currently i am using pyocr with tesseract 3.05.01. I am using pyocr.get_available_tools() to get tesseract. Is there any way i can preserve_interword_spaces for tesseract with help of pyocr.

jflesch commented 7 years ago

Assuming you're using Tesseract (pyocr.tesseract) and not (pyocr.libtesseract) then yes, you can. You can make your own builder. See DigitBuilder and the other builders for reference. My suggestion: Inherit from TextBuilder and in the constructor, just after calling TextBuilder, set self.tesseract_flags and self.tesseract_configs as you need. Then just pass your new builder to pyocr.tesseract.image_to_string() (aka pyocr.get_avalailable_tools()[0].image_to_string()), and you should get the expected result.