Open wwqgtxx opened 6 years ago
The problem here is that init()
provides a handle that must be free with cleanup()
. And with the current Pyocr's API, it's hard to figure out the best time to free it.
Some program may want to keep the same handle as long as they are running, but others (like Paperwork for instance) prefer to have it freed when not used anymore.
So I think this change will imply changing the API in non-backward-compatible way. The API is the same for all the modules, so it will have to be changed on all the others too.
My own patch was add a option input kward to image_to_string()
85 -def image_to_string(image, lang=None, builder=None):
85 +def image_to_string(image, lang=None, builder=None, tesseract_raw_handle=None):
86
86 if builder is None:
87
87 builder = builders.TextBuilder()
88 - handle = tesseract_raw.init(lang=lang)
88 + if tesseract_raw_handle is None:
89 + handle = tesseract_raw.init(lang=lang)
90 + else:
91 + handle = tesseract_raw_handle
89
92
90
93 lvl_line = tesseract_raw.PageIteratorLevel.TEXTLINE
91
94 lvl_word = tesseract_raw.PageIteratorLevel.WORD
92
95
93
96 try:
94 - # XXX(Jflesch): Issue #51:
95 - # Tesseract TessBaseAPIRecognize() may segfault when the target
96 - # language is not available
97 - clang = lang if lang else "eng"
98 - for lang_item in clang.split("+"):
99 - if lang_item not in tesseract_raw.get_available_languages(handle):
100 - raise TesseractError(
101 - "no lang",
102 - "language {} is not available".format(lang_item)
103 - )
97 + if tesseract_raw_handle is None:
98 + # XXX(Jflesch): Issue #51:
99 + # Tesseract TessBaseAPIRecognize() may segfault when the target
100 + # language is not available
101 + clang = lang if lang else "eng"
102 + for lang_item in clang.split("+"):
103 + if lang_item not in tesseract_raw.get_available_languages(handle):
104 + raise TesseractError(
105 + "no lang",
106 + "language {} is not available".format(lang_item)
107 + )
104
108
105
109 tesseract_raw.set_page_seg_mode(
106
110 handle, builder.tesseract_layout
... ...
@@ -159,7 +163,8 @@ def image_to_string(image, lang=None, builder=None):
159
163 break
160
164
161
165 finally:
162 - tesseract_raw.cleanup(handle)
166 + if tesseract_raw_handle is None:
167 + tesseract_raw.cleanup(handle)
163
168
164
169 return builder.get_output()
add I init and cleanup the handle by myself
tesseract_raw_handle = libtesseract.tesseract_raw.init("eng")
try:
for image in images:
libtesseract.image_to_string(
image,
lang="eng",
builder=builders.DigitBuilder(7),
tesseract_raw_handle=tesseract_raw_handle
)
finally:
libtesseract.tesseract_raw.cleanup(tesseract_raw_handle)
maybe add a new class base api like ImageToString
class is a optional way to solve this problem, and we can use weakref.finalize
to force call the cleanup
when the instance of ImageToString
class was gc to avoid user forget free the handle.Of course, told users use a with ImageToString() as i:
to call cleanup
at __exit__
was the best way.
Interresting idea. But still a new API. So I'll consider it, but for a next major new version (PyOCR2 :).
add a note, before we want to reuse the handle
we need to call TessBaseAPIClearAdaptiveClassifier
to avoid recognition the different picture cause tesseract internal struct change
When I use the image_to_string() function frequently, I find the tesseract_raw.init()'s call use the most of CPU times (by pstat). Then I read the code about image_to_string() found it call init() to get libtesseract handle each time when call. This is a advise that could use a threadlocal based cache or a class based cache the libtesseract handle to reuse that and I supposed it can make program run faster. Thanks.