Closed hurbeana closed 3 years ago
This works for me with the latest version of ocrmypdf, so please try the latest version.
Sorry for responding so late.
I have now retried everything by installing the latest version of ocrmypdf (11.3.2.post9+g4fc7d6d9) via pip from the repo and the issue still persists. Here is the output:
ocrmypdf 11.3.2.post9+g4fc7d6d9
Running: ['tesseract', '--list-langs']
The 'hocr' PDF renderer is known to cause problems with one or more of the languages in your document. Use --pdf-renderer auto (the default) to avoid this issue.
Running: ['unpaper', '--version']
Found unpaper 6.1
Running: ['tesseract', '--version']
Found tesseract 4.0.0
Running: ['gs', '--version']
Found gs 9.27
pikepdf mmap disabled
os.symlink(/data/problem_page.pdf, /tmp/com.github.ocrmypdf.k4i2m2hx/origin)
os.symlink(/tmp/com.github.ocrmypdf.k4i2m2hx/origin, /tmp/com.github.ocrmypdf.k4i2m2hx/origin.pdf)
pikepdf mmap disabled
Scanning contents: 100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 230.05page/s]
Using Tesseract OpenMP thread limit 1
pikepdf mmap disabled
1 page already has text! - rasterizing text and running OCR anyway
1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=jpeggray', '-dFirstPage=1', '-dLastPage=1', '-r150.164090x150.164090', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/com.github.ocrmypdf.k4i2m2hx/origin.pdf']
1 Rotating output by 0
1 Running: ['tesseract', '-l', 'osd', '--psm', '0', '/tmp/com.github.ocrmypdf.k4i2m2hx/000001_rasterize_preview.jpg', 'stdout']
1 page is facing ⇩, confidence 1.87 - confidence too low to rotate
1 Rasterize with png16m, rotation 0
1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=png16m', '-dFirstPage=1', '-dLastPage=1', '-r150.164090x150.164090', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/com.github.ocrmypdf.k4i2m2hx/origin.pdf']
1 Rotating output by 0
1 Running: ['unpaper', '-v', '--dpi', '150.16409', '--layout', 'none', '--mask-scan-size', '100', '--no-border-align', '--no-mask-center', '--no-grayfilter', '--no-blackfilter', '--no-deskew', '/tmp/tmpkq9q__px/input.pnm', '/tmp/tmpkq9q__px/output.ppm']
1 stdout/stderr = unpaper: error: unable to open file /tmp/tmpkq9q__px/input.pnm: wrong stream
Try 'man unpaper' for more information.
unpaper 6.1
License GPLv2: GNU GPL version 2.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
-------------------------------------------------------------------------------
Processing sheet #1: /tmp/tmpkq9q__px/input.pnm -> /tmp/tmpkq9q__px/output.ppm
OCR: 0%| | 0.0/1.0 [00:02<?, ?page/s]
An exception occurred while executing the pipeline
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/opt/conda/lib/python3.8/site-packages/ocrmypdf/_sync.py", line 189, in exec_page_sync
ocr_image, preprocess_out = make_intermediate_images(
File "/opt/conda/lib/python3.8/site-packages/ocrmypdf/_sync.py", line 158, in make_intermediate_images
ocr_image = preprocess(
File "/opt/conda/lib/python3.8/site-packages/ocrmypdf/_sync.py", line 105, in preprocess
image = preprocess_clean(image, page_context)
File "/opt/conda/lib/python3.8/site-packages/ocrmypdf/_pipeline.py", line 478, in preprocess_clean
unpaper.clean(input_file, output_file, dpi.x, page_context.options.unpaper_args)
File "/opt/conda/lib/python3.8/site-packages/ocrmypdf/_exec/unpaper.py", line 123, in clean
run(input_file, output_file, dpi, unpaper_args)
File "/opt/conda/lib/python3.8/site-packages/ocrmypdf/_exec/unpaper.py", line 82, in run
external_run(
File "/opt/conda/lib/python3.8/site-packages/ocrmypdf/subprocess.py", line 68, in run
proc = subprocess_run(args, env=env, **kwargs)
File "/opt/conda/lib/python3.8/subprocess.py", line 512, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['unpaper', '-v', '--dpi', '150.16409', '--layout', 'none', '--mask-scan-size', '100', '--no-border-align', '--no-mask-center', '--no-grayfilter', '--no-blackfilter', '--no-deskew', '/tmp/tmpkq9q__px/input.pnm', '/tmp/tmpkq9q__px/output.ppm']' returned non-zero exit status 1.
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/ocrmypdf/_sync.py", line 374, in run_pipeline
exec_concurrent(context)
File "/opt/conda/lib/python3.8/site-packages/ocrmypdf/_sync.py", line 271, in exec_concurrent
exec_progress_pool(
File "/opt/conda/lib/python3.8/site-packages/ocrmypdf/_concurrent.py", line 112, in exec_progress_pool
result = results.next()
File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 865, in next
raise value
subprocess.CalledProcessError: Command '['unpaper', '-v', '--dpi', '150.16409', '--layout', 'none', '--mask-scan-size', '100', '--no-border-align', '--no-mask-center', '--no-grayfilter', '--no-blackfilter', '--no-deskew', '/tmp/tmpkq9q__px/input.pnm', '/tmp/tmpkq9q__px/output.ppm']' returned non-zero exit status 1.
Temporary working files retained at:
/tmp/com.github.ocrmypdf.k4i2m2hx
There seems to be some issue with calling unpaper. I have also tried building unpaper from source, to no avail.
Can you provide a test file?
There is one in the issue under "Example file". The problem_page.pdf.
Try this patch
diff --git a/src/ocrmypdf/_exec/unpaper.py b/src/ocrmypdf/_exec/unpaper.py
index e17ebb1..3619229 100644
--- a/src/ocrmypdf/_exec/unpaper.py
+++ b/src/ocrmypdf/_exec/unpaper.py
@@ -79,6 +79,9 @@ def run(input_file, output_file, dpi, mode_args):
# This should ensure that a user cannot clobber some other file with
# their unpaper arguments (whether intentionally or otherwise)
args_unpaper.extend([os.fspath(input_pnm), os.fspath(output_pnm)])
+ from shutil import copy
+
+ copy(input_pnm, input_file.with_suffix(input_pnm.suffix))
external_run(
args_unpaper,
close_fds=True,
And check if the .pnm file is readable in an image editor. It should be the only .pnm file.
Then try passing it to unpaper
unpaper -v --dpi 150.16409 --layout none --mask-scan-size 100 --no-border-align --no-mask-center --no-grayfilter --no-blackfilter --no-deskew file_from_above.pnm output.pnm
Can unpaper process the file?
Based on when "wrong stream"
can occur from unpaper, it's likely your unpaper is miscompiled.
https://github.com/unpaper/unpaper/blob/a88ed4e47d84a4831ad1828b89e98d289676ecfc/file.c#L56
Did the patch, the pnm seems fine and opens (using sxiv). Unpaper indeed fails when passing it the file:
unpaper 6.1
License GPLv2: GNU GPL version 2.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
-------------------------------------------------------------------------------
Processing sheet #1: 000001_pp_deskew.pnm -> output.pnm
unpaper: error: unable to open file 000001_pp_deskew.pnm: wrong stream
Try 'man unpaper' for more information.
Im not quite sure why it it would be miscompiled, because I have just pulled the latest stable release for debian buster from the official repos. Hm interesting, well, I can try and compile unpaper from source and check back!
So I went through and tried building unpaper from source
apt remove unpaper
wget https://www.flameeyes.com/files/unpaper-6.1.tar.xz
tar -xf unpaper-6.1.tar.xz
cd unpaper-6.1
apt install gcc make automake pkg-config libavformat-dev libavcodec-dev libavutil-dev
./configure
make
make install
ln -s /usr/local/bin/unpaper /usr/bin/unpaper
unpaper -v --dpi 150.16409 --layout none --mask-scan-size 100 --no-border-align --no-mask-center --no-grayfilter --no-blackfilter --no-deskew 000001_pp_deskew.pnm output.pnm
The unpaper --version
reports back the wanted 6.1, and still, even with unpaper built from source doesn't seem to work. At least we know the issue is possibly with unpaper.
Output from ./configure
:
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... no
checking for mawk... mawk
checking whether make sets $(MAKE)... no
checking whether make supports nested variables... no
checking whether make supports nested variables... (cached) no
checking whether to enable maintainer-specific portions of Makefiles... yes
checking for style of include used by make... none
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking dependency style of gcc... none
checking for gcc option to accept ISO C99... none needed
checking how to run the C preprocessor... gcc -E
checking for grep that handles long lines and -e... /bin/grep
checking for egrep... /bin/grep -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking minix/config.h usability... no
checking minix/config.h presence... no
checking for minix/config.h... no
checking whether it is safe to define __EXTENSIONS__... yes
checking for stdbool.h that conforms to C99... yes
checking for _Bool... yes
checking for library containing sqrt... -lm
checking for pkg-config... /usr/bin/pkg-config
checking pkg-config is at least version 0.9.0... yes
checking for LIBAV... yes
checking for xsltproc... xsltproc
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating Makefile
config.status: executing depfiles commands
Nothing too bad, maybe im missing something here?
While building there seem to be some interesting warning with regards to the codec which is on the line you pointed out:
file.c:68:5: warning: ‘codec’ is deprecated [-Wdeprecated-declarations]
if (s->streams[0]->codec->codec_type != AVMEDIA_TYPE_VIDEO)
^~
In file included from file.c:29:
/usr/include/x86_64-linux-gnu/libavformat/avformat.h:878:21: note: declared here
AVCodecContext *codec;
^~~~~
file.c:71:5: warning: ‘avcodec_copy_context’ is deprecated [-Wdeprecated-declarations]
ret = avcodec_copy_context(avctx, s->streams[0]->codec);
^~~
In file included from file.c:28:
/usr/include/x86_64-linux-gnu/libavcodec/avcodec.h:4178:5: note: declared here
int avcodec_copy_context(AVCodecContext *dest, const AVCodecContext *src);
^~~~~~~~~~~~~~~~~~~~
file.c:71:5: warning: ‘codec’ is deprecated [-Wdeprecated-declarations]
ret = avcodec_copy_context(avctx, s->streams[0]->codec);
^~~
In file included from file.c:29:
/usr/include/x86_64-linux-gnu/libavformat/avformat.h:878:21: note: declared here
AVCodecContext *codec;
^~~~~
file.c:96:5: warning: ‘avcodec_decode_video2’ is deprecated [-Wdeprecated-declarations]
ret = avcodec_decode_video2(avctx, frame, &got_frame, &pkt);
^~~
In file included from file.c:28:
/usr/include/x86_64-linux-gnu/libavcodec/avcodec.h:4771:5: note: declared here
int avcodec_decode_video2(AVCodecContext *avctx, AVFrame *picture,
^~~~~~~~~~~~~~~~~~~~~
file.c: In function ‘saveImage’:
file.c:159:5: warning: ‘filename’ is deprecated [-Wdeprecated-declarations]
snprintf(out_ctx->filename, sizeof(out_ctx->filename), "%s", filename);
^~~~~~~~
In file included from file.c:29:
/usr/include/x86_64-linux-gnu/libavformat/avformat.h:1417:10: note: declared here
char filename[1024];
^~~~~~~~
file.c:159:5: warning: ‘filename’ is deprecated [-Wdeprecated-declarations]
snprintf(out_ctx->filename, sizeof(out_ctx->filename), "%s", filename);
^~~~~~~~
In file included from file.c:29:
/usr/include/x86_64-linux-gnu/libavformat/avformat.h:1417:10: note: declared here
char filename[1024];
^~~~~~~~
file.c:196:5: warning: ‘codec’ is deprecated [-Wdeprecated-declarations]
codec_ctx = video_st->codec;
^~~~~~~~~
In file included from file.c:29:
/usr/include/x86_64-linux-gnu/libavformat/avformat.h:878:21: note: declared here
AVCodecContext *codec;
^~~~~
file.c:224:5: warning: ‘avcodec_encode_video2’ is deprecated [-Wdeprecated-declarations]
ret = avcodec_encode_video2(video_st->codec, &pkt, image, &got_packet);
^~~
In file included from file.c:28:
/usr/include/x86_64-linux-gnu/libavcodec/avcodec.h:5402:5: note: declared here
int avcodec_encode_video2(AVCodecContext *avctx, AVPacket *avpkt,
^~~~~~~~~~~~~~~~~~~~~
file.c:224:5: warning: ‘codec’ is deprecated [-Wdeprecated-declarations]
ret = avcodec_encode_video2(video_st->codec, &pkt, image, &got_packet);
^~~
In file included from file.c:29:
/usr/include/x86_64-linux-gnu/libavformat/avformat.h:878:21: note: declared here
AVCodecContext *codec;
^~~~~
file.c:217:5: warning: ignoring return value of ‘avformat_write_header’, declared with attribute warn_unused_result [-Wunused-result]
avformat_write_header(out_ctx, NULL);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CC unpaper-imageprocess.o
CC unpaper-parse.o
CC unpaper-tools.o
CC unpaper-unpaper.o
unpaper.c: In function ‘main’:
unpaper.c:947:5: warning: ‘avcodec_register_all’ is deprecated [-Wdeprecated-declarations]
avcodec_register_all();
^~~~~~~~~~~~~~~~~~~~
In file included from unpaper.c:31:
/usr/include/x86_64-linux-gnu/libavcodec/avcodec.h:4102:6: note: declared here
void avcodec_register_all(void);
^~~~~~~~~~~~~~~~~~~~
unpaper.c:948:5: warning: ‘av_register_all’ is deprecated [-Wdeprecated-declarations]
av_register_all();
^~~~~~~~~~~~~~~
In file included from unpaper.c:32:
/usr/include/x86_64-linux-gnu/libavformat/avformat.h:2043:6: note: declared here
void av_register_all(void);
^~~~~~~~~~~~~~~
CCLD unpaper
I think that it could be, that my libavformat library is too new, due to the deprecated warnings (especially the warning: ‘codec’ is deprecated
one) I'm getting while building unpaper?
I believe that the size of the image may be involved somehow. It seems like older libavformat is rejecting images that are not an appropriate power of 2, or something like that. The image produced by your file works out to 1241x1754. If I resize the image to 256x256 it works without issue on Debian buster. But simply changes like 1240x1754 do not work; I was not able to discern what libavformat expects. In any event it seems like this is a libavformat bug that in the version that shipped with Debian buster that since been resolved.
Has this issue: ffmpeg 4.1.6, libavformat 58.20.100 (observed by testing Debian buster docker image) Issue fixed: ffmpeg 4.3.1, libavformat 58.45.100 (observed on macOS 10.14)
If you are able to upgrade to a newer Debian, that will likely resolve your issue.
Perhaps you could report this issue to unpaper's github? Some defensive coding against bad versions may be in order.
@Flameeyes pinging you to consider my findings here.
Aside:
These build warnings IMO should be fixed, but not indicative of trouble. The same warnings appear in the Debian buster build: https://buildd.debian.org/status/fetch.php?pkg=unpaper&arch=amd64&ver=6.1-2%2Bb2&stamp=1532015166&raw=0
Debian's packaging does not run unpaper's test suite. If it builds, they accept it. So unfortunately unpaper's inclusion in Debian is no guarantee of it working correctly.
Alright thank you, seems to be the problem. I'm currently using a miniconda3 docker image, which has Debian Buster, but I should be able to switch to some other image and install python. I'll forward this issue to the unpaper repo, see if there is something that can be done to maybe alleviate or at least soften the problem and I'll also note the warnings for the buster build! Thank you so much for your help, I'll report back when I can confirm that using a newer image fixes the issue!
So, just to make sure, I have tested this whole thing with an alpine linux docker image, using the unpaper provided by the world repo.
docker run --rm -it frolvlad/alpine-miniconda3 /bin/ash
/ # cat /etc/alpine-release
3.12.0
/ # apk add unpaper
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/community/x86_64/APKINDEX.tar.gz
(1/49) Installing sdl2 (2.0.12-r1)
....
(49/49) Installing unpaper (6.1-r1)
Executing busybox-1.31.1-r16.trigger
Executing glibc-bin-2.32-r0.trigger
/usr/glibc-compat/sbin/ldconfig: /usr/glibc-compat/lib/ld-linux-x86-64.so.2 is not a symbolic link
OK: 75 MiB in 66 packages
/ # unpaper --version
6.1
/ # cd root
~ # ls
000001_pp_deskew.pnm
~ # unpaper -v --dpi 150.16409 --layout none --mask-scan-size 100 --no-border-align --no-mask-center --no-grayfilter --no-blackfilter --no-deskew 000001_pp_deskew.pnm output.pnm
unpaper 6.1
License GPLv2: GNU GPL version 2.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
-------------------------------------------------------------------------------
Processing sheet #1: 000001_pp_deskew.pnm -> output.pnm
[ppm_pipe @ 0x56079f691a00] Stream #0: not enough frames to estimate rate; consider increasing probesize
input-file for sheet 1: 000001_pp_deskew.pnm
output-file for sheet 1: output.pnm
sheet size: 1241x1754
...
noise-filter ... deleted 82 clusters.
blur-filter... deleted 0 pixels.
writing output.
[image2 @ 0x56079f692800] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x56079f692800] Encoder did not produce proper pts, making some up.
The output file is fine and openable in an image viewer such as sxiv (although it looks the same, but that doesn't matter, since there is not much to fix in this page of the document).
Closing since this appears to be a third party in older versions of an unpaper dependency.
Describe the bug When running ocrmypdf on a specific document with "--remove-background" unpaper fails to process the pdf. This may be possible due to there being no text or almost no text on this specific page, although other pages have text.
To Reproduce
This outputs:
Example file I have attached a pdf with the problematic page to this issue. Due to privacy restrictions I cannot provide the full pdf, but I have tested ocrmypdf with this specific pdf and the same issue still persists, even with the single page, as can be seen in the output. Here is the file problem_page.pdf.
Expected behavior The pdf should get preprocessed and OCRed like other pdfs, but unpaper breaks when using the "--remove-background" option for ocrmypdf. When removing this option the PDF gets OCRed as expected.
System
8.0.1+dfsg
docker build
withapt-get ocrmypdf