openpaperwork / pyocr

A Python wrapper for Tesseract and Cuneiform -- Moved to Gnome's Gitlab
https://gitlab.gnome.org/World/OpenPaperwork/pyocr
930 stars 152 forks source link

pyocr with latest Tesseract fails with pyocr.error.TesseractError: "Error, unknown command line argument '-psm'\n") #99

Open ddddavidmartin opened 6 years ago

ddddavidmartin commented 6 years ago

Good day,

I'm using pyocr through Paperless on a Ubuntu setup. I'm using the tesseract-ocr PPA [0] and on the latest version [1] pyocr throws an error.

[0]

cat /etc/apt/sources.list.d/alex-p-ubuntu-tesseract-ocr-artful.list
deb http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu artful main

[1]

tesseract --version
tesseract 4.0.0-beta.1-302-g3aa9
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.8 : zlib 1.2.11 : libwebp 0.6.0 : libopenjp2 2.3.0

Traceback:

littlebig@littlebig:~/Dev/paperless$ python3 /home/littlebig/Dev/paperless/src/manage.py document_consumer
Starting document consumer at /home/littlebig/paperless_consumption_dir with inotify
Parsers available: RasterisedDocumentParser
Consuming /home/littlebig/paperless_consumption_dir/BRW90CDB68D60F5_000798.pdf
Processing sheet #1: /tmp/paperless/paperless-b5bgnwtm/convert-0000.pnm -> /tmp/paperless/paperless-b5bgnwtm/convert-0000.unpaper.pnm
[pgm_pipe @ 0x55cbcbdfb980] Stream #0: not enough frames to estimate rate; consider increasing probesize
[image2 @ 0x55cbcbe00140] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x55cbcbe00140] Encoder did not produce proper pts, making some up.
OCRing the document
Parsing for eng
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/littlebig/Dev/paperless/src/paperless_tesseract/parsers.py", line 290, in image_to_string
    return ocr.image_to_string(f, lang=lang)
  File "/home/littlebig/.local/lib/python3.6/site-packages/pyocr/tesseract.py", line 367, in image_to_string
    raise TesseractError(status, errors)
pyocr.error.TesseractError: (1, b"Error, unknown command line argument '-psm'\n")
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/littlebig/Dev/paperless/src/manage.py", line 18, in <module>
    execute_from_command_line(sys.argv)
  File "/home/littlebig/.local/lib/python3.6/site-packages/django/core/management/__init__.py", line 364, in execute_from_command_line
    utility.execute()
  File "/home/littlebig/.local/lib/python3.6/site-packages/django/core/management/__init__.py", line 356, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home/littlebig/.local/lib/python3.6/site-packages/django/core/management/base.py", line 283, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/home/littlebig/.local/lib/python3.6/site-packages/django/core/management/base.py", line 330, in execute
    output = self.handle(*args, **options)
  File "/home/littlebig/Dev/paperless/src/documents/management/commands/document_consumer.py", line 98, in handle
    self.loop_inotify(mail_delta)
  File "/home/littlebig/Dev/paperless/src/documents/management/commands/document_consumer.py", line 131, in loop_inotify
    self.loop_step(mail_delta)
  File "/home/littlebig/Dev/paperless/src/documents/management/commands/document_consumer.py", line 123, in loop_step
    self.file_consumer.consume_new_files()
  File "/home/littlebig/Dev/paperless/src/documents/consumer.py", line 107, in consume_new_files
    if not self.try_consume_file(file):
  File "/home/littlebig/Dev/paperless/src/documents/consumer.py", line 145, in try_consume_file
    date = parsed_document.get_date()
  File "/home/littlebig/Dev/paperless/src/paperless_tesseract/parsers.py", line 209, in get_date
    text = self.get_text()
  File "/home/littlebig/Dev/paperless/src/paperless_tesseract/parsers.py", line 80, in get_text
    self._text = self._get_ocr(images)
  File "/home/littlebig/Dev/paperless/src/paperless_tesseract/parsers.py", line 140, in _get_ocr
    raw_text = self._ocr([imgs[middle]], self.DEFAULT_OCR_LANGUAGE)
  File "/home/littlebig/Dev/paperless/src/paperless_tesseract/parsers.py", line 189, in _ocr
    r = pool.map(image_to_string, itertools.product(imgs, [lang]))
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
pyocr.error.TesseractError: (1, b"Error, unknown command line argument '-psm'\n")
littlebig@littlebig:~/Dev/paperless$

Has anyone else come across this? Thanks!

ddddavidmartin commented 6 years ago

Having a look through the pyocr sources this stands out to me:

src/pyocr/builders.py
307-        file_ext = ["txt"]
308:        tess_flags = ["-psm", str(tesseract_layout)]
309-        cun_args = ["-f", "text"]
--
564-        file_ext = ["html", "hocr"]
565:        tess_flags = ["-psm", str(tesseract_layout)]
566-        tess_conf = ["hocr"]
--
640-        file_ext = ["html", "hocr"]
641:        tess_flags = ["-psm", str(tesseract_layout)]
642-        tess_conf = ["hocr"]

Does pyocr just use -psm instead of --psm as the parameter? I'm wondering whether that is not accepted anymore now.

ddddavidmartin commented 6 years ago

Does pyocr just use -psm instead of --psm as the parameter? I'm wondering whether that is not accepted anymore now.

It looks like this is the problem. I have changed the passed options in builds.py to provide --psm instead of -psm and it works fine now. I might create a pull request for this though I'm not sure whether there are any other implications of this.

The commit in question in tesseract is the following: https://github.com/tesseract-ocr/tesseract/commit/ee201e1f4fa277a4b2ecd751a45d3bf1eba6dfdb

simonm3 commented 6 years ago

I also came across this today. I note that -psm is used not just in builders.py but also in tesseract.py.

jflesch commented 6 years ago

https://github.com/openpaperwork/pyocr/pull/100

ddddavidmartin commented 6 years ago

I haven't had a chance yet to work out the circular import statements that I introduced in https://github.com/ddddavidmartin/pyocr/tree/update_deprecated_psm_option_string. If anyone wants to step in, feel free to give it a go.

For now, a quick and dirty fix is to just apply https://github.com/openpaperwork/pyocr/pull/100/commits/c136838b46cf49f06ac1dc5f2f9bc16232c11213.