ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.13k stars 1.02k forks source link

--sidecar creates a blank .txt file with no text #194

Closed dev-code-davis closed 7 years ago

dev-code-davis commented 7 years ago

Hi, as mentioned in documentation, I tried to --sidecar option to create .txt file containing only OCR text since I'm interested only getting the OCR text and not PDF with OCR text.

However, when using

ocrmypdf --sidecar test.txt test.pdf test_out.pdf

I just get a PDF with OCR layer added and an empty .txt file with no text.

I'm doing something incorrectly?

jbarlow83 commented 7 years ago

Was any text added to the PDF?

Check in your PDF viewer if you can select text and paste it somewhere.

I probably won't be able to help much further unless you include the input file.

On Thu, 26 Oct 2017 at 00:36 Gugols notifications@github.com wrote:

Hi, as mentioned in documentation, I tried to --sidecar option to create .txt file containing only OCR text since I'm interested only getting the OCR text and not PDF with OCR text.

However, when using

ocrmypdf --sidecar test.txt test.pdf test_out.pdf I just get a PDF with OCR layer added and an empty .txt file with no text.

I'm doing something incorrectly?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jbarlow83/OCRmyPDF/issues/194, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvcM8ldLuPZtG19dtwZMEiKQS8PUDD4ks5swDZugaJpZM4QHIOq .

dev-code-davis commented 7 years ago

Thanks for your quick reply. Unfortunately I cant share the original PDFs as they are marked as confidential by client, however I tested it with a sample PDF I found online and got the same result - the PDF gets OCR'ed and text can be easily copied, however the .txt file is still blank: Source I used: http://solutions.weblite.ca/pdfocrx/scansmpl.pdf

PDF output: https://www.dropbox.com/s/ke49hm4mv2pem0c/scansmpl_out.pdf?dl=0 .txt output: blank

OS: "Ubuntu 16.04.2 LTS" Tesseract: tesseract 3.04.01 leptonica-1.73 libgif 5.1.2 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.0

EDIT: ocrmypdf --version: 5.4.1

dev-code-davis commented 7 years ago

By the way, I just noticed that when the .txt gets initially created the OSX finder shows the text file icon as it would have content - lines can be seen, but after a split of second the icon again shows as blank. P.S. I'm using NFS share between OSX and Ubuntu virtual machine.

EDIT: The icon changs seems like a regular OSX behaviour when new file is created and should be ignored.

jbarlow83 commented 7 years ago

Regression in v5.4.1, fixed for v5.4.2 (which I just pushed now, so it should be available on PyPI in ~15 minutes)

dev-code-davis commented 7 years ago

@jbarlow83 Thanks for the fix. The command work from the command line. However, I'm trying to understand whether the following problem is somehow related (thus not creating a new ticket).

Im trying to execute the command from a PHP script. For example, the following command work correctly and produces a OCR'ed PDF file.

$c = ('ocrmypdf input.pdf output.pdf 2>&1');
exec($c, $output);
print_r($output);

However, when I try to use the --sidecar argument, I just get a blank text file WITHOUT any PDF (as well as an error).

Code:

$c = ('ocrmypdf -l lav --sidecar text_output.txt input.pdf output.pdf --pdf-renderer tesseract --output-type pdf 2>&1');
exec($c, $output);
print_r($output);

Error:

Array ( [0] => ERROR - Traceback (most recent call last): [1] => File "/usr/local/lib/python3.5/dist-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions [2] => register_cleanup, touch_files_only) [3] => File "/usr/local/lib/python3.5/dist-packages/ruffus/task.py", line 567, in job_wrapper_io_files [4] => ret_val = user_defined_work_func(*params) [5] => File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pipeline.py", line 968, in merge_sidecars [6] => write_pages(out) [7] => File "/usr/local/lib/python3.5/dist-packages/ocrmypdf/pipeline.py", line 949, in write_pages [8] => txt = in_.read() [9] => File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode [10] => return codecs.ascii_decode(input, self.errors)[0] [11] => UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 9: ordinal not in range(128) [12] => )

jbarlow83 commented 7 years ago

You'll need to ensure that the locale is property configured for Python 3. It appears that PHP thinks it is in an ASCII Latin-1 locale. Try env LANG=C.UTF-8 ocrmypdf....

Regarding what you're trying to do, there is a third party ocrmypdf-web program that creates a HTTP API for ocrmypdf, so that might be an option.

jbarlow83 commented 7 years ago

As of v5.4.3 ocrmypdf should refuse to work in an ASCII locale.