jwilk-archive / ocrodjvu

OCR for DjVu
GNU General Public License v2.0
45 stars 19 forks source link

-X extra_args='-psm 1' option #40

Closed derrikF closed 3 years ago

derrikF commented 3 years ago

ocrodjvu (0.7.15) Tesseract: make it possible to pass the -psm option in order to customize layout analysis. For example, to enable OSD, use: -X extra_args='-psm 1'

an error occurs when trying to use this feature in ocrodjvu 0.12

ocrodjvu -e tesseract --language rus -X extra_args='-psm 7' --page=163 --in-place


Exception while processing page 163:
Traceback (most recent call last):
  File "/usr/local/share/ocrodjvu/lib/cli/ocrodjvu.py", line 478, in page_thread
    result = self.process_page(page)
  File "/usr/local/share/ocrodjvu/lib/cli/ocrodjvu.py", line 451, in process_page
    result = self._engine.recognize(pfile, language=self._options.language, details=self._options.details, uax29=self._options.uax29)
  File "/usr/local/share/ocrodjvu/lib/engines/tesseract.py", line 284, in recognize
    return f(image, language, details=details, uax29=uax29)
  File "/usr/local/share/ocrodjvu/lib/engines/tesseract.py", line 247, in recognize_hocr
    _wait_for_worker(worker)
  File "/usr/local/share/ocrodjvu/lib/engines/tesseract.py", line 68, in _wait_for_worker
    worker.wait()
  File "/usr/local/share/ocrodjvu/lib/ipc.py", line 129, in wait
    raise CalledProcessError(return_code, self.__command)
CalledProcessError: Command 'tesseract' returned non-zero exit status 1.
Intermediate files were left in the '/tmp/ocrodjvu.iFDyKV' directory.
ocrodjvu 0.12
+ Python 2.7.18
+ subprocess32
+ python-djvulibre 0.8
+ lxml 4.5.0

so should this option work or not?

jsbien commented 3 years ago

There is a typo in your command. It should be extra_args='--psm 7'.

derrikF commented 3 years ago

Thanks, I took what is written in the changelog and didn't think there might be a typo ...

jwilk commented 2 years ago

FWIW, -psm (with single hyphen) was correct when that changelog entry was written. The option name was changed in Tesseract only later on: https://github.com/tesseract-ocr/tesseract/commit/92d981b93a4c54f6727681968451e7de72cc8b69 https://github.com/tesseract-ocr/tesseract/commit/ee201e1f4fa277a4b2ecd751a45d3bf1eba6dfdb