jwilk-archive / ocrodjvu

OCR for DjVu
GNU General Public License v2.0
45 stars 19 forks source link

crashes on non-UTF-8 file identifiers #4

Closed jwilk closed 11 years ago

jwilk commented 11 years ago

Issue reported by GStager at Bitbucket:

stager@stager-laptop:~/massocr$ ~/ocrodjvu-0.7.15/ocrodjvu --in-place -e tesseract -t words --html5 --clear-text -lrus+eng 9af500e27db4351d7391f463c0e3f017.djvu
Processing '9af500e27db4351d7391f463c0e3f017.djvu':
Intermediate files were left in the '/tmp/ocrodjvu.z1KiDA' directory.
Traceback (most recent call last):
  File "/home/stager1/ocrodjvu-0.7.15/ocrodjvu", line 7, in <module>
    _.main(sys.argv)
  File "/home/stager1/ocrodjvu-0.7.15/lib/cli/ocrodjvu.py", line 533, in main
    context.process(options.path, options.pages)
  File "/home/stager1/ocrodjvu-0.7.15/lib/cli/ocrodjvu.py", line 515, in process
    self._process(*args, **kwargs)
  File "/home/stager1/ocrodjvu-0.7.15/lib/cli/ocrodjvu.py", line 471, in _process
    file_id = page.file.id.encode(system_encoding)
  File "decode.pyx", line 840, in djvu.decode.File.id.__get__ (djvu/decode.c:7605)
  File "common.pxi", line 128, in djvu.decode.decode_utf8 (djvu/decode.c:2802)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf2 in position 0: invalid continuation byte

Original file

jwilk commented 11 years ago

Thanks for the bug report.

This is really a problem with the DjVu file in question. Its pages identifies are not in UTF-8, but in a some legacy 8-bit encoding instead.

I'll add a work-around in ocrodjvu for this, but you should fix the DjVu file. You can do that by converting it to indirect (with djvm), and then perhaps back to bundled.

jwilk commented 11 years ago

Fixed in 9a11f6a5101f75257d90a6fcedbe583bed72e366.

jwilk commented 11 years ago

Fixed in 0.7.16.