In a multipage TIFF, results are returned only from the first page

openpaperwork / pyocr

A Python wrapper for Tesseract and Cuneiform -- Moved to Gnome's Gitlab

https://gitlab.gnome.org/World/OpenPaperwork/pyocr

930 stars 152 forks source link

In a multipage TIFF, results are returned only from the first page #76

Closed Omnipresent closed 7 years ago

Omnipresent commented 7 years ago

In a multipage tiff file, the results are returned only for the first page. This, however, works from the tesseract command line.

Here is an example of a multipage TIFF file: https://www.dropbox.com/s/qh72ec84su9zsj6/multipage.tiff?dl=0

txt = tool.image_to_string( Image.open('multipage.tiff'), lang=lang, builder=pyocr.builders.TextBuilder() ) print txt

shows

This is page one

jflesch commented 7 years ago

Hm, not sure it can be considered a Pyocr problem. Looks like to me more like a Pillow (PIL.Image) limitation.

Omnipresent commented 7 years ago

I believe the underlying problem is tesseract c API. I opened this bug that was closed https://github.com/tesseract-ocr/tesseract/issues/1138

Although the command line returns all the text, the capi only returns the last page (first page in case of pyocr).

Does pyocr use TextRenderer to return the text results from the image or does it use TessBaseAPIGetUTF8Text?

jflesch commented 7 years ago

When using Pyocr, the root problem is that the image has to be opened with Pillow (PIL.Image), and AFAIK, it doesn't support multi-pages tiff files at all. If it would, Pyocr or you could simply send the pages one by one to Tesseract (shell or libtesseract).

Omnipresent commented 7 years ago

PIL seems to support multipage tiffs

 >>> tiffstack = Image.open('multipage.tiff')
 >>> tiffstack.load()
 <PixelAccess object at 0x7fc76bf1dab0>
 >>> print(tiffstack.n_frames)
 2

for the one by one example, are you suggesting that PIL read a multipage tiff and then let us loop over each page and we send each page to pyocr?

jflesch commented 7 years ago

Oh wait, nevermind, it does support it

Omnipresent commented 7 years ago

Oh wait, nevermind, it does support it

Yeah, I think the problem is the way Pyocr is calling tesseract capi. We perhaps need to call a particular method in order to return the entire text.

jflesch commented 7 years ago

Actually, since it does support it, calling Image.seek() can solve your problem easily

jflesch commented 7 years ago

txt = ""
img = Image.open('multipage.tiff')
for frame in range(0, img.n_frames):
    img.seek(frame)
    txt += tool.image_to_string(
        img
        lang=lang,
        builder=pyocr.builders.TextBuilder()
    )

Omnipresent commented 7 years ago

ok, great! that seems to work. I will try to integrate it into my app. As a side note, how does get_available_tools() in pyocr detect Tesseract? In my application for some reach I can't execute tesseract command line but I do have /usr/local/lib/libtesseract.so.3.0.4. Will that be ok for pyocr?

jflesch commented 7 years ago

how does get_available_tools() in pyocr detect Tesseract?

For the command line tool, it looks on your PATH (See https://github.com/openpaperwork/pyocr/blob/master/src/pyocr/tesseract.py#L379 and https://github.com/openpaperwork/pyocr/blob/master/src/pyocr/util.py#L25 ). For the library libtesseract, it tries to load one or two library names using the standard library loading mechanism ( See https://github.com/openpaperwork/pyocr/blob/master/src/pyocr/libtesseract/tesseract_raw.py#L39 + http://tldp.org/HOWTO/Program-Library-HOWTO/shared-libraries.html ).

So in your case, libtesseract.so.3.0.4 should work fine as long as you also have a symbolic link libtesseract.so.3 --> libtesseract.so.3.0.4.

jflesch commented 7 years ago

I'm going to close this issue. If you still have problems with multipage TIFF, don't hesitate to comment here again, and I'll reopen it.

Omnipresent commented 7 years ago

So in your case, libtesseract.so.3.0.4 should work fine as long as you also have a symbolic link libtesseract.so.3 --> libtesseract.so.3.0.4

I'm discovering that libtesseract isn't found via get_available_tools() for me. I have /usr/local/lib/libtesseract.so.3.0.4. Do I also have a symbolic link /usr/local/lib/libtesseract.so.3 --> /usr/local/lib/libtesseract.so.3.0.4?

Is there any way I can over ride this setting?

Omnipresent commented 7 years ago

My application is running on a PaaS and it might not be feasible for me to create a sum link. Is there a way to avoid a sum link? I already have the needed libtesseract.so.3.0.4

Omnipresent commented 7 years ago

Actually I have the sym link but it is not in /usr/local/lib

vcap@63~$ ls -al /home/vcap/app/.heroku/vendor/lib/libtesseract.so.3
lrwxrwxrwx 1 vcap vcap 21 Jan  7  2017 /home/vcap/app/.heroku/vendor/lib/libtesseract.so.3 -> libtesseract.so.3.0.4

Is there a way to change its location in pyocr?

jflesch commented 7 years ago

Again, when looking for libraries, Pyocr doesn't look for specific locations. It let the dynamic linker do the job. You may want to have a look at the environment variable defining where the dynamic linker look for libraries : http://tldp.org/HOWTO/Program-Library-HOWTO/shared-libraries.html

It does look for a specific library name however. Currently, it can't be changed without patching Pyocr.

Omnipresent commented 7 years ago

Yeah, I was loading tesseract like that too (which worked):

libname = '/home/vcap/app/.heroku/vendor/lib/libtesseract.so.3'
self.tesseract = cdll.LoadLibrary(libname)

so i guess my dynamic linkers must not be working. I'll take a look.