recent python-ruffus error

sagittarius06 commented 7 years ago

On Archlinux, ocrmypdf recently stops working.

$ ocrmypdf -l fra scan0196.pdf out.pdf ERROR - Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions register_cleanup, touch_files_only) File "/usr/lib/python3.6/site-packages/ruffus/task.py", line 567, in job_wrapper_io_files ret_val = user_defined_work_func(*params) File "/usr/lib/python3.6/site-packages/ocrmypdf/pipeline.py", line 497, in ocr_tesseract_hocr log=log File "/usr/lib/python3.6/site-packages/ocrmypdf/exec/tesseract.py", line 232, in generate_hocr universal_newlines=True, timeout=timeout) File "/usr/lib/python3.6/subprocess.py", line 336, in check_output **kwargs).stdout File "/usr/lib/python3.6/subprocess.py", line 405, in run stdout, stderr = process.communicate(input, timeout=timeout) File "/usr/lib/python3.6/subprocess.py", line 836, in communicate stdout, stderr = self._communicate(input, endtime, timeout) File "/usr/lib/python3.6/subprocess.py", line 1533, in _communicate self.stdout.errors) File "/usr/lib/python3.6/subprocess.py", line 735, in _translate_newlines data = data.decode(encoding, errors) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 155: invalid start byte

jbarlow83 commented 7 years ago

It seems that tesseract printed an invalid character to its standard output. Maybe this is a tesseract 3.05 issue as that was just released. Please send the file if possible. On Sun, Mar 5, 2017 at 09:33 sagittarius06 notifications@github.com wrote:

On Archlinux, ocrmypdf recently stops working.

$ ocrmypdf -l fra scan0196.pdf out.pdf ERROR - Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions register_cleanup, touch_files_only) File "/usr/lib/python3.6/site-packages/ruffus/task.py", line 567, in job_wrapper_io_files ret_val = user_defined_work_func(*params) File "/usr/lib/python3.6/site-packages/ocrmypdf/pipeline.py", line 497, in ocr_tesseract_hocr log=log File "/usr/lib/python3.6/site-packages/ocrmypdf/exec/tesseract.py", line 232, in generate_hocr universal_newlines=True, timeout=timeout) File "/usr/lib/python3.6/subprocess.py", line 336, in check_output **kwargs).stdout File "/usr/lib/python3.6/subprocess.py", line 405, in run stdout, stderr = process.communicate(input, timeout=timeout) File "/usr/lib/python3.6/subprocess.py", line 836, in communicate stdout, stderr = self._communicate(input, endtime, timeout) File "/usr/lib/python3.6/subprocess.py", line 1533, in _communicate self.stdout.errors) File "/usr/lib/python3.6/subprocess.py", line 735, in _translate_newlines data = data.decode(encoding, errors) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 155: invalid start byte

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jbarlow83/OCRmyPDF/issues/140, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvcM2yHnjWcfzA7BaKPU8xFVu453-Nuks5riseigaJpZM4MTbeh .

sagittarius06 commented 7 years ago

Here is an example that fails scanned on my HP 8600 scan0197.pdf

For info:

$ unpaper -version 6.1 $ tesseract --version tesseract 3.05.00 leptonica-1.74 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.1) : libpng 1.6.28 : libtiff 4.0.7 : zlib 1.2.11 : libwebp 0.5.2

sagittarius06 commented 7 years ago

Issue resolved with latest update.

It seems it was because of tesseract language files : pkgInstallDateLister --explicit

tesseract-data-deu-1:3.04.00-1 2017-03-07 11:30:05 tesseract-data-eng-1:3.04.00-1 2017-03-07 11:30:06 tesseract-data-fra-1:3.04.00-1 2017-03-07 11:30:06

rennefJ commented 6 years ago

I am seeing exactly the same issue. I am using the homebrew version on macOS 10.13.1. 'UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 155: invalid start byte' I haven't used it in some time, so I am not sure since when it does not work anymore. ocrmypdf version 5.4.3 and tesseract version is 3.05.01 If it is really the tesseract data files, I am using the most recent ones for the 3.05 release.

Any suggitions on how to fix it are appriciated.

jbarlow83 commented 6 years ago

Do you have a file and command line that demonstrates the issue? Or do all files and arguments seem to fail?

Can you run tesseract on an image on its own?

On Nov 28, 2017 06:57, "rennefJ" notifications@github.com wrote:

I am seeing exactly the same issue. I am using the homebrew version on macOS 10.13.1. 'UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 155: invalid start byte' I haven't used it in some time, so I am not sure since when it does not work anymore. ocrmypdf version 5.4.3 and tesseract version is 3.05.01 If it is really the tesseract data files, I am using the most recent ones for the 3.05 release.

Any suggitions on how to fix it are appriciated.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jbarlow83/OCRmyPDF/issues/140#issuecomment-347549239, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvcMzjm24yNDVTxY9hVLLyGiJuoahkkks5s7B9cgaJpZM4MTbeh .

rennefJ commented 6 years ago

Thank you for asking these questions. I was able to solve the issue. The error only occurred when selecting German as language. It turns out I was using the German tessdata file for the 4.0 branch instead of the 3.05 branch. This was just my mistake since I don‘t know how to download the German language file automatically I always download it manually.

jbarlow83 commented 6 years ago

@rennefJ I added a change to v5.4.4 that should print a helpful error instead of suppressing the error from tesseract. If you can test it again with v5.4.4 and let know what happens. I was not able to replicate it exactly by replacing 3.05 tessdata with 4.00.

rennefJ commented 6 years ago

@jbarlow83 I updated to v5.4.4 and ran with the wrong tessdata file again. What happens is that is puts out 100k lines of text on the console. The first line is the following error message:

ERROR - 1: [tesseract] command line output was not utf-8. This usually means Tesseract's language packs do not match the installed version of Tesseract.

The rest is INFO level messages. I have attached the console output as a compressed text file. ocymypdf_testlog.txt.zip In the end it creates the output pdf, which it did not do before, and writes:

INFO - Output file is a PDF/A-2B (as expected)

The first line is a useful error message, but it could get drowned in all the other output created.

ocrmypdf / OCRmyPDF

recent python-ruffus error #140