openpaperwork / pyocr

A Python wrapper for Tesseract and Cuneiform -- Moved to Gnome's Gitlab
https://gitlab.gnome.org/World/OpenPaperwork/pyocr
931 stars 152 forks source link

Specify regions of text #67

Closed MrXu closed 7 years ago

MrXu commented 7 years ago

Hi, is there a way to specify the regions of text in pyocr? Currently, I am cropping out the text-regions, and give them to pyocr one-at-a-time. This help avoid some inaccuracies in Tesseract's page-layout analysis. But, it's very slow.

jflesch commented 7 years ago

Kind of. You can make your own builder object with the required Tesseract configuration, and pass it to tool.image_to_string(). See src/pyocr/builders.py:BaseBuilder for reference.

However, what is most likely to slow you down is actually not the cropping part, but running Tesseract on every chunk (fork() + exec() are slow). You may want to try with pyocr.libtesseract instead.

MrXu commented 7 years ago

Hi @jflesch , thank you for the suggestion. Could you explain a bit more on "try with pyocr.libtesseract"? I found the links of ibtesseract and tesseract-ocr are the same in the readme.

jflesch commented 7 years ago

Instead of doing

import pyocr
tool = pyocr.get_available_tools()[0]

(which will use pyocr.tesseract)

Try:

import pyocr.libtesseract
tool = pyocr.libtesseract
MrXu commented 7 years ago

@jflesch sorry to reply a closed issue. May I know if there's any requirements of os or tesseract version? I am encountering Assertion Error for assert(g_libtesseract). As far as I know, libtesseract is the engine behind tesseract-ocr project, since I have build tesseract-ocr locally, it should be there, right?

jflesch commented 7 years ago

in Debian/Ubuntu, libtesseract is in a different package --> sudo apt install libtesseract3