Closed derrikF closed 6 years ago
Non-ASCII filenames used to work in the past, but I accidentally broke them in 0.9.1 (6ad242f742c99ecc17e2622e8ec17870388698c8).
Minimal reproducer:
$ ocrodjvu --dry-run $(printf '\320\220')
Intermediate files were left in the '/tmp/ocrodjvu.V92tY_' directory.
Traceback (most recent call last):
File ".../ocrodjvu", line 7, in <module>
_.main(sys.argv)
File ".../cli/ocrodjvu.py", line 563, in main
context.process(options.path, options.pages)
File ".../cli/ocrodjvu.py", line 545, in process
self._process(*args, **kwargs)
File ".../cli/ocrodjvu.py", line 465, in _process
logger.info('Processing {path}:'.format(path=utils.smart_repr(path, system_encoding)))
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0410' in position 1: ordinal not in range(128)
Thanks for the bug report. I'll get this fixed soon.
I checked now, it seems like this error no longer gives
~ $ python '/home/derrik/ocrodjvu-master/ocrodjvu' -e tesseract --language rus+eng --in-place "/my_data/Сканы/тест.djvu"
Processing '/my_data/Сканы/тест.djvu':
- Page # 1
tesseract: Tesseract Open Source OCR Engine v4.0.0-beta.3-199-gba757 with Leptonica
tesseract: Page 1
tesseract: Detected 13 diacritics
Fixed in 0.10.4 (a5a6902a778460f33e7aa0ab1cf79113752d58d5).
please add support for Cyrillic in the paths to the files, constantly error
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10-14: ordinal not in range(128)