jwilk-archive / ocrodjvu

OCR for DjVu
GNU General Public License v2.0
44 stars 19 forks source link

Tesseract: 3.02: Malformed hOCR document: character zones intermixed with non-character zones #8

Open jwilk opened 10 years ago

jwilk commented 10 years ago

Issue reported by anonymous at Bitbucket:

Thank you very much for ocrodjvu. I am using ocrodjvu with the options --engine=tesseract -l deu. Versions are:

tesseract: 3.02 ocrodjvu: 0.7.16

With the attached page I get the following exception:

/usr/share/ocrodjvu/lib/hocr.py:435: EncodingWarning: byte 0x10 in position 25317: control character
  contents = utils.sanitize_utf8(contents)
Exception while processing page 1:
Traceback (most recent call last):
  File "/usr/share/ocrodjvu/lib/cli/ocrodjvu.py", line 418, in page_thread
    result = self.process_page(page)
  File "/usr/share/ocrodjvu/lib/cli/ocrodjvu.py", line 401, in process_page
    page_size=size
  File "/usr/share/ocrodjvu/lib/engines/tesseract.py", line 271, in extract_text
    return self._hocr.extract_text(stream, **kwargs)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 473, in extract_text
    scan_result = scan(doc.find('/body'), settings)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 374, in scan
    for zone in _scan(node, settings, settings.page_size):
  File "/usr/share/ocrodjvu/lib/hocr.py", line 239, in _scan
    return get_children(node)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 198, in get_children
    result += _scan(child, settings, page_size)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 257, in _scan
    children = get_children(node)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 198, in get_children
    result += _scan(child, settings, page_size)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 257, in _scan
    children = get_children(node)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 198, in get_children
    result += _scan(child, settings, page_size)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 257, in _scan
    children = get_children(node)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 198, in get_children
    result += _scan(child, settings, page_size)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 285, in _scan
    raise errors.MalformedHocr("character zones intermixed with non-character zones")
MalformedHocr: Malformed hOCR document: character zones intermixed with non-character zones

Attachment: t-p-086.pgm.djvu.zip

jwilk commented 10 years ago

I can't reproduce it here. :-( Could you try upgrading Tesseract to 3.02.02 and see if it helps?

jwilk commented 9 years ago

Ping?