jwilk-archive / ocrodjvu

OCR for DjVu
GNU General Public License v2.0
45 stars 19 forks source link

Non-ASCII filenames cause UnicodeEncodeError #23

Closed derrikF closed 6 years ago

derrikF commented 6 years ago

please add support for Cyrillic in the paths to the files, constantly error

UnicodeEncodeError: 'ascii' codec can't encode characters in position 10-14: ordinal not in range(128)

jwilk commented 6 years ago

Non-ASCII filenames used to work in the past, but I accidentally broke them in 0.9.1 (6ad242f742c99ecc17e2622e8ec17870388698c8).

Minimal reproducer:

$ ocrodjvu --dry-run $(printf '\320\220')
Intermediate files were left in the '/tmp/ocrodjvu.V92tY_' directory.
Traceback (most recent call last):
  File ".../ocrodjvu", line 7, in <module>
    _.main(sys.argv)
  File ".../cli/ocrodjvu.py", line 563, in main
    context.process(options.path, options.pages)
  File ".../cli/ocrodjvu.py", line 545, in process
    self._process(*args, **kwargs)
  File ".../cli/ocrodjvu.py", line 465, in _process
    logger.info('Processing {path}:'.format(path=utils.smart_repr(path, system_encoding)))
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0410' in position 1: ordinal not in range(128)

Thanks for the bug report. I'll get this fixed soon.

derrikF commented 6 years ago

I checked now, it seems like this error no longer gives

~ $ python '/home/derrik/ocrodjvu-master/ocrodjvu' -e tesseract --language rus+eng --in-place "/my_data/Сканы/тест.djvu"
Processing '/my_data/Сканы/тест.djvu':
- Page # 1
tesseract: Tesseract Open Source OCR Engine v4.0.0-beta.3-199-gba757 with Leptonica
tesseract: Page 1
tesseract: Detected 13 diacritics
jwilk commented 6 years ago

Fixed in 0.10.4 (a5a6902a778460f33e7aa0ab1cf79113752d58d5).