Closed Omnipresent closed 7 years ago
Hm, not sure it can be considered a Pyocr problem. Looks like to me more like a Pillow (PIL.Image) limitation.
I believe the underlying problem is tesseract c API. I opened this bug that was closed https://github.com/tesseract-ocr/tesseract/issues/1138
Although the command line returns all the text, the capi only returns the last page (first page in case of pyocr).
Does pyocr use TextRenderer
to return the text results from the image or does it use TessBaseAPIGetUTF8Text
?
When using Pyocr, the root problem is that the image has to be opened with Pillow (PIL.Image), and AFAIK, it doesn't support multi-pages tiff files at all. If it would, Pyocr or you could simply send the pages one by one to Tesseract (shell or libtesseract).
PIL seems to support multipage tiffs
>>> tiffstack = Image.open('multipage.tiff')
>>> tiffstack.load()
<PixelAccess object at 0x7fc76bf1dab0>
>>> print(tiffstack.n_frames)
2
for the one by one example, are you suggesting that PIL read a multipage tiff and then let us loop over each page and we send each page to pyocr?
Oh wait, nevermind, it does support it
Oh wait, nevermind, it does support it
Yeah, I think the problem is the way Pyocr is calling tesseract capi. We perhaps need to call a particular method in order to return the entire text.
Actually, since it does support it, calling Image.seek() can solve your problem easily
txt = ""
img = Image.open('multipage.tiff')
for frame in range(0, img.n_frames):
img.seek(frame)
txt += tool.image_to_string(
img
lang=lang,
builder=pyocr.builders.TextBuilder()
)
ok, great! that seems to work. I will try to integrate it into my app. As a side note, how does get_available_tools()
in pyocr detect Tesseract
? In my application for some reach I can't execute tesseract command line but I do have /usr/local/lib/libtesseract.so.3.0.4
. Will that be ok for pyocr?
how does get_available_tools() in pyocr detect Tesseract?
For the command line tool, it looks on your PATH (See https://github.com/openpaperwork/pyocr/blob/master/src/pyocr/tesseract.py#L379 and https://github.com/openpaperwork/pyocr/blob/master/src/pyocr/util.py#L25 ). For the library libtesseract, it tries to load one or two library names using the standard library loading mechanism ( See https://github.com/openpaperwork/pyocr/blob/master/src/pyocr/libtesseract/tesseract_raw.py#L39 + http://tldp.org/HOWTO/Program-Library-HOWTO/shared-libraries.html ).
So in your case, libtesseract.so.3.0.4
should work fine as long as you also have a symbolic link libtesseract.so.3 --> libtesseract.so.3.0.4
.
I'm going to close this issue. If you still have problems with multipage TIFF, don't hesitate to comment here again, and I'll reopen it.
So in your case, libtesseract.so.3.0.4 should work fine as long as you also have a symbolic link libtesseract.so.3 --> libtesseract.so.3.0.4
I'm discovering that libtesseract isn't found via get_available_tools()
for me. I have /usr/local/lib/libtesseract.so.3.0.4
. Do I also have a symbolic link /usr/local/lib/libtesseract.so.3 --> /usr/local/lib/libtesseract.so.3.0.4
?
Is there any way I can over ride this setting?
My application is running on a PaaS and it might not be feasible for me to create a sum link. Is there a way to avoid a sum link? I already have the needed libtesseract.so.3.0.4
Actually I have the sym link but it is not in /usr/local/lib
vcap@63~$ ls -al /home/vcap/app/.heroku/vendor/lib/libtesseract.so.3
lrwxrwxrwx 1 vcap vcap 21 Jan 7 2017 /home/vcap/app/.heroku/vendor/lib/libtesseract.so.3 -> libtesseract.so.3.0.4
Is there a way to change its location in pyocr?
Again, when looking for libraries, Pyocr doesn't look for specific locations. It let the dynamic linker do the job. You may want to have a look at the environment variable defining where the dynamic linker look for libraries : http://tldp.org/HOWTO/Program-Library-HOWTO/shared-libraries.html
It does look for a specific library name however. Currently, it can't be changed without patching Pyocr.
Yeah, I was loading tesseract like that too (which worked):
libname = '/home/vcap/app/.heroku/vendor/lib/libtesseract.so.3'
self.tesseract = cdll.LoadLibrary(libname)
so i guess my dynamic linkers must not be working. I'll take a look.
In a multipage tiff file, the results are returned only for the first page. This, however, works from the tesseract command line.
Here is an example of a multipage TIFF file: https://www.dropbox.com/s/qh72ec84su9zsj6/multipage.tiff?dl=0
shows