Closed MrXu closed 7 years ago
Kind of. You can make your own builder object with the required Tesseract configuration, and pass it to tool.image_to_string()
. See src/pyocr/builders.py:BaseBuilder
for reference.
However, what is most likely to slow you down is actually not the cropping part, but running Tesseract on every chunk (fork() + exec() are slow). You may want to try with pyocr.libtesseract
instead.
Hi @jflesch , thank you for the suggestion. Could you explain a bit more on "try with pyocr.libtesseract"? I found the links of ibtesseract
and tesseract-ocr
are the same in the readme.
Instead of doing
import pyocr
tool = pyocr.get_available_tools()[0]
(which will use pyocr.tesseract
)
Try:
import pyocr.libtesseract
tool = pyocr.libtesseract
@jflesch sorry to reply a closed issue. May I know if there's any requirements of os or tesseract version? I am encountering Assertion Error
for assert(g_libtesseract)
. As far as I know, libtesseract is the engine behind tesseract-ocr project, since I have build tesseract-ocr locally, it should be there, right?
in Debian/Ubuntu, libtesseract is in a different package --> sudo apt install libtesseract3
Hi, is there a way to specify the regions of text in pyocr? Currently, I am cropping out the text-regions, and give them to pyocr one-at-a-time. This help avoid some inaccuracies in Tesseract's page-layout analysis. But, it's very slow.