openpaperwork / pyocr

A Python wrapper for Tesseract and Cuneiform -- Moved to Gnome's Gitlab
https://gitlab.gnome.org/World/OpenPaperwork/pyocr
930 stars 152 forks source link

[Libtesseract] Reduce calls to tesseract_raw.init() #89

Open wwqgtxx opened 6 years ago

wwqgtxx commented 6 years ago

When I use the image_to_string() function frequently, I find the tesseract_raw.init()'s call use the most of CPU times (by pstat). Then I read the code about image_to_string() found it call init() to get libtesseract handle each time when call. This is a advise that could use a threadlocal based cache or a class based cache the libtesseract handle to reuse that and I supposed it can make program run faster. Thanks.

jflesch commented 6 years ago

The problem here is that init() provides a handle that must be free with cleanup(). And with the current Pyocr's API, it's hard to figure out the best time to free it. Some program may want to keep the same handle as long as they are running, but others (like Paperwork for instance) prefer to have it freed when not used anymore.

So I think this change will imply changing the API in non-backward-compatible way. The API is the same for all the modules, so it will have to be changed on all the others too.

wwqgtxx commented 6 years ago

My own patch was add a option input kward to image_to_string()  

85   -def image_to_string(image, lang=None, builder=None):
85   +def image_to_string(image, lang=None, builder=None, tesseract_raw_handle=None):
86  
86     if builder is None:
87  
87         builder = builders.TextBuilder()
88    -    handle = tesseract_raw.init(lang=lang)
88    +    if tesseract_raw_handle is None:
89    +        handle = tesseract_raw.init(lang=lang)
90    +    else:
91    +        handle = tesseract_raw_handle
89  
92
90  
93     lvl_line = tesseract_raw.PageIteratorLevel.TEXTLINE
91  
94     lvl_word = tesseract_raw.PageIteratorLevel.WORD
92  
95
93  
96     try:
94    -        # XXX(Jflesch): Issue #51:
95    -        # Tesseract TessBaseAPIRecognize() may segfault when the target
96    -        # language is not available
97    -        clang = lang if lang else "eng"
98    -        for lang_item in clang.split("+"):
99    -            if lang_item not in tesseract_raw.get_available_languages(handle):
100   -                raise TesseractError(
101   -                    "no lang",
102   -                    "language {} is not available".format(lang_item)
103   -                )
97    +        if tesseract_raw_handle is None:
98    +            # XXX(Jflesch): Issue #51:
99    +            # Tesseract TessBaseAPIRecognize() may segfault when the target
100   +            # language is not available
101   +            clang = lang if lang else "eng"
102   +            for lang_item in clang.split("+"):
103   +                if lang_item not in tesseract_raw.get_available_languages(handle):
104   +                    raise TesseractError(
105   +                        "no lang",
106   +                        "language {} is not available".format(lang_item)
107   +                    )
104 
108
105 
109         tesseract_raw.set_page_seg_mode(
106 
110             handle, builder.tesseract_layout
... ...
@@ -159,7 +163,8 @@ def image_to_string(image, lang=None, builder=None):
159 
163                 break
160 
164
161 
165     finally:
162   -        tesseract_raw.cleanup(handle)
166   +        if tesseract_raw_handle is None:
167   +            tesseract_raw.cleanup(handle)
163 
168
164 
169     return builder.get_output()

add I init and cleanup the handle by myself

            tesseract_raw_handle = libtesseract.tesseract_raw.init("eng")
            try:
                for image in images:
                    libtesseract.image_to_string(
                     image,
                     lang="eng",
                     builder=builders.DigitBuilder(7),
                     tesseract_raw_handle=tesseract_raw_handle
                   )
            finally:
                libtesseract.tesseract_raw.cleanup(tesseract_raw_handle)
wwqgtxx commented 6 years ago

maybe add a new class base api like ImageToString class is a optional way to solve this problem, and we can use weakref.finalize to force call the cleanup when the instance of ImageToString class was gc to avoid user forget free the handle.Of course, told users use a with ImageToString() as i: to call cleanup at __exit__ was the best way.

jflesch commented 6 years ago

Interresting idea. But still a new API. So I'll consider it, but for a next major new version (PyOCR2 :).

wwqgtxx commented 6 years ago

add a note, before we want to reuse the handle we need to call TessBaseAPIClearAdaptiveClassifier to avoid recognition the different picture cause tesseract internal struct change