jwilk-archive / ocrodjvu

OCR for DjVu
GNU General Public License v2.0
44 stars 19 forks source link

[debian] ocrodjvu: error: OCR engine (ocropus) was not found #20

Closed ghost closed 5 years ago

ghost commented 7 years ago

On debian stable I ran aptitude install ocrodjvu and it installed without error. Then I simply tried to run it without doing anything fancy:

ocrodjvu --in-place my_raster_doc.djvu

and it gives:

ocrodjvu: error: OCR engine (ocropus) was not found

So then I ran apt-file search ocropus which gives ocrodjvu: /usr/share/ocrodjvu/lib/engines/ocropus.py. This file exists, yet ocrodjvu isn't finding it.

ghost commented 7 years ago

Silly workaround: Adding the optional switch --engine= makes it work (even though ocrodjvu already knows the engines available). So you can write:

ocrodjvu --engine="$(ocrodjvu --list-engines | sed -ne 1p)"

In my case it was using "cuneiform" in that arg that made it work.

IMO, ocrodjvu needs to default to an engine from the list that engines that it already knows exists. Or at a bare minimum the error message should instruct the user to supply the --engine option.

jwilk commented 7 years ago

Thanks for the bug report.

I ran apt-file search ocropus which gives ocrodjvu: /usr/share/ocrodjvu/lib/engines/ocropus.py. This file exists, yet ocrodjvu isn't finding it.

The file apt-file finds is just some glue code that would let ocrodjvu use OCRopus if it was installed. The last stable release of Debian that shipped with OCRopus was squeeze.

OCRopus was the first OCR engine supported by ocrodjvu, and it remained the default for a very long time. (Arguably, too long.)

I changed the default to Tesseract in ocrodjvu 0.8.

In a way, this is already fixed in Debian unstable: it ships ocrodjvu 0.10.1, and the package has Recommends: tesseract-ocr. (APT installs recommends by default.)

ocrodjvu --engine="$(ocrodjvu --list-engines | sed -ne 1p)"

This does the trick when you have exactly one supported OCR engine installed and it happens to be the one you want to use. The --list-engines option doesn't guarantee any particular order of items. (AFAICS they are always sorted alphabetically, but that's an implementation accident, not by design.)

IMO, ocrodjvu needs to detect available engines and default to one that exists.

That would mean that merely installing a new OCR engine could change the default, possibly breaking the user's scripts. I'd strongly dislike this kind of non-determinism.

the error message should instruct the user to supply the --engine option.

That's a good idea, yes.

ghost commented 7 years ago

Thanks for the reply.

This does the trick when you have exactly one supported OCR engine installed...

The | sed -ne 1p takes only the first line, in case there are more.

That would mean that merely installing a new OCR engine could change the default, possibly breaking the user's scripts. I'd strongly dislike this kind of non-determinism.

OTOH, if someone were to install a new engine, it could entail the maintenance burden of having to update their scripts. In any case, script writers can choose their poison (hard-code the engine, or use the sed hack).

I would be more concerned with users experiencing ocrodjvu for the first time. Debian users want their tools to just work out of the box, while script writers a more prepared to dig into the options. It's an immediate disappointment when a new package is installed, and it pukes immediately. Impatient users will just remove the package without investigating.

Possible solutions:

Just a brainstorm. I don't care myself because now I'm aware of the situation.

jwilk commented 5 years ago

the error message should instruct the user to supply the --engine option.

This is now implemented in 0.11. If you don't have the default engine installed, you get:

ocrodjvu: error: OCR engine (tesseract) was not found; use -e/--engine to use another engine

Users of Debian stretch (or a later version) should normally never see this, because ocrodjvu recommends the default engine (tesseract-ocr).