Closed mskrip closed 6 years ago
I am not sure whether it is worth using builder here with image_to_string
and not maintaining consistency with libtesseract
ocr tool for better tool switching. If so I can remake it no problem.
To my knowledge Tesseract is supporting PDF creation since 3.03-rc1
. Syntax should not have changed although I have not tested it.
I'll add some tests for confirming successful completion at least. Checking contents of the PDF would not be reliable.
1) Darn, I forgot about libtesseract :/. Anyway, libtesseract needs fixing. It should be using builders as well. But in the case of libtesseract, it's much more complicated (it's a partially-different API that must be used, or PyOCR will have to generate the PDF file by itself .. :/). I will have to work on that. Regarding cli tesseract, please stick to builders as much as possible.
2) Hm, weird, tesseract-ocr 3.05 provided by Ubuntu doesn't seem to contain the config file 'pdf'. Meh, not really a problem anyway :)
3) Yep, could be a good thing. Please take care of checking that the size of the output file is != 0 too. Last time I tried with Tesseract 4.x alpha, it returned a success error code but generated empty pdf files.
ping ?
Sorry too much school work, no idea when I'll have time to get to it :(
No problem :) I'm going to close this pull request. Feel free to reopen one when you have time.
1) Hm, adding pdf support is a good idea, but why add a new function to the API instead of using the existing
image_to_string()
and passing it a new builder object ?You can have a look at
pyocr.tesseract.CharBoxBuilder
for instance.An (untested) example:
Then, it could be used this way:
Since this builder is specific to Tesseract, put in
pyocr/tesseract.py
instead ofpyocr/builders.py
, and it would be fine.2) If I'm not mistaken, PDF support is only available since Tesseract >= 4, right ? If so, it should be documented. Even better, an
assert()
should be added in the builder regarding the output oftesseract.get_version()
(you can have a look attesseract.can_detect_orientation()
for an example).3) If possible, it would be best to add at least one or two tests regarding this new format. It could avoid regressions later. However, since Tesseract probably never generate twice the same PDF, I'm not sure how it could be done in a convenient manner.