jwilk-archive / ocrodjvu

OCR for DjVu
GNU General Public License v2.0
45 stars 19 forks source link

OCR engine executable path should be configurable #36

Open xelxebar opened 5 years ago

xelxebar commented 5 years ago

Overview

On Void Linux, the tesseract binary resides at /usr/bin/tesseract-ocr due to a naming conflict with the game Tesseract. It would be nice if the paths to the OCR engine could be explicitly specified, e.g. via a command line option, environment variable, or configuration file.

Version Information

$ ocrodjvu --version
ocrodjvu 0.11
+ Python 2.7.16
+ subprocess32
+ python-djvulibre 0.8.4
+ lxml 4.3.3

$ lsb_release --all
LSB Version:    1.0
Distributor ID: VoidLinux
Description:    Void Linux
Release:    rolling
Codename:   void

Comments

For the moment, I am hacking around this issue by packing ocrodjvu on my distro with the following patch:

--- a/lib/engines/tesseract.py
+++ b/lib/engines/tesseract.py
@@ -111,7 +111,7 @@
     image_format = image_io.TIFF
     needs_utf8_fix = True

-    executable = utils.property('tesseract')
+    executable = utils.property('tesseract-ocr')
     extra_args = utils.property([], shlex.split)
     use_hocr = utils.property(None, int)
     fix_html = utils.property(0, int)
jwilk commented 5 years ago

It's not documented at the moment, but you can specify the executable via command line with:

-X executable=tesseract-ocr
xelxebar commented 5 years ago

Oh! Nice. Thanks for the quick feedback. Are there any gotchas? If it's a reasonably stable option, would be nice to put it in the docs.

jwilk commented 5 years ago

I considered using the Tesseract API (maybe through tesserocr), instead of using the CLI, which would would render the executable setting meaningless. But realistically, the switch to API is unlikely to happen in the foreseeable future.

Yes, -X executable=… (and other -X goodies) should be documented.