openpaperwork / pyocr

A Python wrapper for Tesseract and Cuneiform -- Moved to Gnome's Gitlab
https://gitlab.gnome.org/World/OpenPaperwork/pyocr
931 stars 152 forks source link

Implement image_to_pdf for cli tesseract #79

Closed mskrip closed 6 years ago

jflesch commented 6 years ago

1) Hm, adding pdf support is a good idea, but why add a new function to the API instead of using the existing image_to_string() and passing it a new builder object ?

You can have a look at pyocr.tesseract.CharBoxBuilder for instance.

An (untested) example:

class PdfBuilder(builders.BaseBuilder):
    def __init__(self, textonly=False):
        file_ext = ['pdf']
        to = 1 if textonly else 0
        tess_flags = [
            "--psm", "1",
            "-c", "textonly_pdf={}".format(to)
        ]
        tess_conf = ["pdf"]

    @staticmethod
    def read_file(file_descriptor):
        """
        In the case of this builder, we don't return the content. Just 
        """
        return file_descriptor.read()

    @staticmethod
    def write_file(file_descriptor):
        raise UnsupportedOperation("nop !")

Then, it could be used this way:

from pyocr.tesseract import PdfBuilder
from pyocr import tesseract

pdf = tesseract.image_to_string('toto.png', PdfBuilder())

Since this builder is specific to Tesseract, put in pyocr/tesseract.py instead of pyocr/builders.py, and it would be fine.

2) If I'm not mistaken, PDF support is only available since Tesseract >= 4, right ? If so, it should be documented. Even better, an assert() should be added in the builder regarding the output of tesseract.get_version() (you can have a look at tesseract.can_detect_orientation() for an example).

3) If possible, it would be best to add at least one or two tests regarding this new format. It could avoid regressions later. However, since Tesseract probably never generate twice the same PDF, I'm not sure how it could be done in a convenient manner.

mskrip commented 6 years ago
  1. I am not sure whether it is worth using builder here with image_to_string and not maintaining consistency with libtesseract ocr tool for better tool switching. If so I can remake it no problem.

  2. To my knowledge Tesseract is supporting PDF creation since 3.03-rc1. Syntax should not have changed although I have not tested it.

  3. I'll add some tests for confirming successful completion at least. Checking contents of the PDF would not be reliable.

jflesch commented 6 years ago

1) Darn, I forgot about libtesseract :/. Anyway, libtesseract needs fixing. It should be using builders as well. But in the case of libtesseract, it's much more complicated (it's a partially-different API that must be used, or PyOCR will have to generate the PDF file by itself .. :/). I will have to work on that. Regarding cli tesseract, please stick to builders as much as possible.

2) Hm, weird, tesseract-ocr 3.05 provided by Ubuntu doesn't seem to contain the config file 'pdf'. Meh, not really a problem anyway :)

3) Yep, could be a good thing. Please take care of checking that the size of the output file is != 0 too. Last time I tried with Tesseract 4.x alpha, it returned a success error code but generated empty pdf files.

jflesch commented 6 years ago

ping ?

mskrip commented 6 years ago

Sorry too much school work, no idea when I'll have time to get to it :(

jflesch commented 6 years ago

No problem :) I'm going to close this pull request. Feel free to reopen one when you have time.