ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
12.8k stars 936 forks source link

unpaper: error: unable to open file: wrong stream #665

Closed hurbeana closed 3 years ago

hurbeana commented 3 years ago

Describe the bug When running ocrmypdf on a specific document with "--remove-background" unpaper fails to process the pdf. This may be possible due to there being no text or almost no text on this specific page, although other pages have text.

To Reproduce

/usr/bin/ocrmypdf -v1 -k --pdf-renderer hocr -l custom_lang -r --remove-background -d -c -f /tmp/testfiles/slimy-olivine-tarantula/problem_page.pdf /tmp/testfiles/slimy-olivine-tarantula/problem_page.pdf.gen.pdf

This outputs:

 DEBUG - ocrmypdf 8.0.1+dfsg
  DEBUG - tesseract 4.0.0
  DEBUG - qpdf 8.4.0
  DEBUG - gs 9.27
WARNING - The 'hocr' PDF renderer is known to cause problems with one or more of the languages in your document.  Use --pdf-renderer auto (the default) to avoid this issue.
  DEBUG - os.symlink(/tmp/testfiles/slimy-olivine-tarantula/problem_page.pdf, /tmp/com.github.ocrmypdf.ydvwnl0r/origin)

________________________________________
Tasks which will be run:

Task enters queue = 'ocrmypdf._pipeline.triage' 
  DEBUG - os.symlink(/tmp/com.github.ocrmypdf.ydvwnl0r/origin, /tmp/com.github.ocrmypdf.ydvwnl0r/origin.pdf)
Completed Task = 'ocrmypdf._pipeline.triage' 
Task enters queue = 'ocrmypdf._pipeline.repair_and_parse_pdf' 
  DEBUG - <PdfInfo('...'), page count=1>
Completed Task = 'ocrmypdf._pipeline.repair_and_parse_pdf' 
Task enters queue = 'ocrmypdf._pipeline.marker_pages' 
Task enters queue = 'ocrmypdf._pipeline.generate_postscript_stub' 
Completed Task = 'ocrmypdf._pipeline.marker_pages' 
Task enters queue = 'ocrmypdf._pipeline.ocr_or_skip' 
  DEBUG - os.symlink(/tmp/com.github.ocrmypdf.ydvwnl0r/000001.marker.pdf, /tmp/com.github.ocrmypdf.ydvwnl0r/000001.ocr.page.pdf)
Completed Task = 'ocrmypdf._pipeline.generate_postscript_stub' 
Completed Task = 'ocrmypdf._pipeline.ocr_or_skip' 
Task enters queue = 'ocrmypdf._pipeline.rasterize_preview' 
  DEBUG - ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=jpeggray', '-dFirstPage=1', '-dLastPage=1', '-r150x150', '-o', '/tmp/tmptqmmf8r9', '-dAutoRotatePages=/None', '-f', '/tmp/com.github.ocrmypdf.ydvwnl0r/000001.ocr.page.pdf']
  DEBUG - 
  DEBUG - Ghostscript: resize output image (1240, 1752) -> (1241, 1754)
Completed Task = 'ocrmypdf._pipeline.rasterize_preview' 
Task enters queue = 'ocrmypdf._pipeline.orient_page' 
   INFO -    1: page is facing ⇩, confidence 1.41 - confidence too low to rotate
  DEBUG - os.symlink(/tmp/com.github.ocrmypdf.ydvwnl0r/000001.ocr.page.pdf, /tmp/com.github.ocrmypdf.ydvwnl0r/000001.ocr.oriented.pdf)
Completed Task = 'ocrmypdf._pipeline.orient_page' 
Task enters queue = 'ocrmypdf._pipeline.rasterize_with_ghostscript' 
  DEBUG - Rasterize 000001.ocr.oriented.pdf with png16m
  DEBUG - ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=png16m', '-dFirstPage=1', '-dLastPage=1', '-r150x150', '-o', '/tmp/tmprsvxdzxy', '-dAutoRotatePages=/None', '-f', '/tmp/com.github.ocrmypdf.ydvwnl0r/000001.ocr.oriented.pdf']
  DEBUG - 
  DEBUG - Ghostscript: resize output image (1240, 1752) -> (1241, 1754)
  DEBUG - Rotating output by 0
Completed Task = 'ocrmypdf._pipeline.rasterize_with_ghostscript' 
Task enters queue = 'ocrmypdf._pipeline.preprocess_remove_background' 
Completed Task = 'ocrmypdf._pipeline.preprocess_remove_background' 
Task enters queue = 'ocrmypdf._pipeline.preprocess_deskew' 
Completed Task = 'ocrmypdf._pipeline.preprocess_deskew' 
Task enters queue = 'ocrmypdf._pipeline.preprocess_clean' 
  DEBUG - unpaper: error: unable to open file /tmp/tmph059pj8d.ppm: wrong stream
Try 'man unpaper' for more information.
unpaper 6.1
License GPLv2: GNU GPL version 2.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

-------------------------------------------------------------------------------
Processing sheet #1: /tmp/tmph059pj8d.ppm -> /tmp/tmpzkq8djaj.ppm

  DEBUG - 

Original exception:

    Exception #1
      'builtins.FileNotFoundError([Errno 2] No such file or directory: '/tmp/tmpzkq8djaj.ppm')' raised in ...
       Task = def ocrmypdf._pipeline.preprocess_clean(...):
       Job  = [.../000001.pp-deskew.png -> .../000001.pp-clean.png, <LoggingProxy>, <ocrmypdf._jobcontext.JobContext>]

    Traceback (most recent call last):
      File "/usr/lib/python3/dist-packages/ocrmypdf/exec/unpaper.py", line 82, in run
        raise e from e
      File "/usr/lib/python3/dist-packages/ocrmypdf/exec/unpaper.py", line 78, in run
        args_unpaper, close_fds=True, universal_newlines=True, stderr=STDOUT
      File "/usr/lib/python3.7/subprocess.py", line 395, in check_output
        **kwargs).stdout
      File "/usr/lib/python3.7/subprocess.py", line 487, in run
        output=stdout, stderr=stderr)
    subprocess.CalledProcessError: Command '['unpaper', '-v', '--dpi', '150.16409036860878', '--layout', 'none', '--mask-scan-size', '100', '--no-border-align', '--no-mask-center', '--no-grayfilter', '--no-blackfilter', '--no-deskew', '/tmp/tmph059pj8d.ppm', '/tmp/tmpzkq8djaj.ppm']' returned non-zero exit status 1.

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "/usr/lib/python3/dist-packages/ruffus/task.py", line 712, in run_pooled_job_without_exceptions
        register_cleanup, touch_files_only)
      File "/usr/lib/python3/dist-packages/ruffus/task.py", line 544, in job_wrapper_io_files
        ret_val = user_defined_work_func(*params)
      File "/usr/lib/python3/dist-packages/ocrmypdf/_pipeline.py", line 569, in preprocess_clean
        unpaper.clean(input_file, output_file, dpi, log)
      File "/usr/lib/python3/dist-packages/ocrmypdf/exec/unpaper.py", line 104, in clean
        '--no-deskew',  # don't deskew
      File "/usr/lib/python3/dist-packages/ocrmypdf/exec/unpaper.py", line 86, in run
        Image.open(output_pnm.name).save(output_file, dpi=(dpi, dpi))
      File "/usr/lib/python3.7/tempfile.py", line 639, in __exit__
        self.close()
      File "/usr/lib/python3.7/tempfile.py", line 646, in close
        self._closer.close()
      File "/usr/lib/python3.7/tempfile.py", line 583, in close
        unlink(self.name)
    FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpzkq8djaj.ppm'

Example file I have attached a pdf with the problematic page to this issue. Due to privacy restrictions I cannot provide the full pdf, but I have tested ocrmypdf with this specific pdf and the same issue still persists, even with the single page, as can be seen in the output. Here is the file problem_page.pdf.

Expected behavior The pdf should get preprocessed and OCRed like other pdfs, but unpaper breaks when using the "--remove-background" option for ocrmypdf. When removing this option the PDF gets OCRed as expected.

System

jbarlow83 commented 3 years ago

This works for me with the latest version of ocrmypdf, so please try the latest version.

hurbeana commented 3 years ago

Sorry for responding so late.

I have now retried everything by installing the latest version of ocrmypdf (11.3.2.post9+g4fc7d6d9) via pip from the repo and the issue still persists. Here is the output:

ocrmypdf 11.3.2.post9+g4fc7d6d9
Running: ['tesseract', '--list-langs']
The 'hocr' PDF renderer is known to cause problems with one or more of the languages in your document.  Use --pdf-renderer auto (the default) to avoid this issue.
Running: ['unpaper', '--version']
Found unpaper 6.1
Running: ['tesseract', '--version']
Found tesseract 4.0.0
Running: ['gs', '--version']
Found gs 9.27
pikepdf mmap disabled
os.symlink(/data/problem_page.pdf, /tmp/com.github.ocrmypdf.k4i2m2hx/origin)
os.symlink(/tmp/com.github.ocrmypdf.k4i2m2hx/origin, /tmp/com.github.ocrmypdf.k4i2m2hx/origin.pdf)
pikepdf mmap disabled                                                                                                                      
Scanning contents: 100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 230.05page/s]
Using Tesseract OpenMP thread limit 1
pikepdf mmap disabled                                                                                                                      
    1 page already has text! - rasterizing text and running OCR anyway                                                                     
    1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=jpeggray', '-dFirstPage=1', '-dLastPage=1', '-r150.164090x150.164090', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/com.github.ocrmypdf.k4i2m2hx/origin.pdf']
    1 Rotating output by 0                                                                                                                 
    1 Running: ['tesseract', '-l', 'osd', '--psm', '0', '/tmp/com.github.ocrmypdf.k4i2m2hx/000001_rasterize_preview.jpg', 'stdout']        
    1 page is facing ⇩, confidence 1.87 - confidence too low to rotate                                                                     
    1 Rasterize with png16m, rotation 0                                                                                                    
    1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=png16m', '-dFirstPage=1', '-dLastPage=1', '-r150.164090x150.164090', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/com.github.ocrmypdf.k4i2m2hx/origin.pdf']
    1 Rotating output by 0                                                                                                                 
    1 Running: ['unpaper', '-v', '--dpi', '150.16409', '--layout', 'none', '--mask-scan-size', '100', '--no-border-align', '--no-mask-center', '--no-grayfilter', '--no-blackfilter', '--no-deskew', '/tmp/tmpkq9q__px/input.pnm', '/tmp/tmpkq9q__px/output.ppm']
    1 stdout/stderr = unpaper: error: unable to open file /tmp/tmpkq9q__px/input.pnm: wrong stream                                         
Try 'man unpaper' for more information.
unpaper 6.1
License GPLv2: GNU GPL version 2.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

-------------------------------------------------------------------------------
Processing sheet #1: /tmp/tmpkq9q__px/input.pnm -> /tmp/tmpkq9q__px/output.ppm

OCR:   0%|                                                                                                     | 0.0/1.0 [00:02<?, ?page/s]
An exception occurred while executing the pipeline
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/opt/conda/lib/python3.8/site-packages/ocrmypdf/_sync.py", line 189, in exec_page_sync
    ocr_image, preprocess_out = make_intermediate_images(
  File "/opt/conda/lib/python3.8/site-packages/ocrmypdf/_sync.py", line 158, in make_intermediate_images
    ocr_image = preprocess(
  File "/opt/conda/lib/python3.8/site-packages/ocrmypdf/_sync.py", line 105, in preprocess
    image = preprocess_clean(image, page_context)
  File "/opt/conda/lib/python3.8/site-packages/ocrmypdf/_pipeline.py", line 478, in preprocess_clean
    unpaper.clean(input_file, output_file, dpi.x, page_context.options.unpaper_args)
  File "/opt/conda/lib/python3.8/site-packages/ocrmypdf/_exec/unpaper.py", line 123, in clean
    run(input_file, output_file, dpi, unpaper_args)
  File "/opt/conda/lib/python3.8/site-packages/ocrmypdf/_exec/unpaper.py", line 82, in run
    external_run(
  File "/opt/conda/lib/python3.8/site-packages/ocrmypdf/subprocess.py", line 68, in run
    proc = subprocess_run(args, env=env, **kwargs)
  File "/opt/conda/lib/python3.8/subprocess.py", line 512, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['unpaper', '-v', '--dpi', '150.16409', '--layout', 'none', '--mask-scan-size', '100', '--no-border-align', '--no-mask-center', '--no-grayfilter', '--no-blackfilter', '--no-deskew', '/tmp/tmpkq9q__px/input.pnm', '/tmp/tmpkq9q__px/output.ppm']' returned non-zero exit status 1.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/ocrmypdf/_sync.py", line 374, in run_pipeline
    exec_concurrent(context)
  File "/opt/conda/lib/python3.8/site-packages/ocrmypdf/_sync.py", line 271, in exec_concurrent
    exec_progress_pool(
  File "/opt/conda/lib/python3.8/site-packages/ocrmypdf/_concurrent.py", line 112, in exec_progress_pool
    result = results.next()
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 865, in next
    raise value
subprocess.CalledProcessError: Command '['unpaper', '-v', '--dpi', '150.16409', '--layout', 'none', '--mask-scan-size', '100', '--no-border-align', '--no-mask-center', '--no-grayfilter', '--no-blackfilter', '--no-deskew', '/tmp/tmpkq9q__px/input.pnm', '/tmp/tmpkq9q__px/output.ppm']' returned non-zero exit status 1.
Temporary working files retained at:
/tmp/com.github.ocrmypdf.k4i2m2hx

There seems to be some issue with calling unpaper. I have also tried building unpaper from source, to no avail.

jbarlow83 commented 3 years ago

Can you provide a test file?

hurbeana commented 3 years ago

There is one in the issue under "Example file". The problem_page.pdf.

jbarlow83 commented 3 years ago

Try this patch

diff --git a/src/ocrmypdf/_exec/unpaper.py b/src/ocrmypdf/_exec/unpaper.py
index e17ebb1..3619229 100644
--- a/src/ocrmypdf/_exec/unpaper.py
+++ b/src/ocrmypdf/_exec/unpaper.py
@@ -79,6 +79,9 @@ def run(input_file, output_file, dpi, mode_args):
         # This should ensure that a user cannot clobber some other file with
         # their unpaper arguments (whether intentionally or otherwise)
         args_unpaper.extend([os.fspath(input_pnm), os.fspath(output_pnm)])
+        from shutil import copy
+
+        copy(input_pnm, input_file.with_suffix(input_pnm.suffix))
         external_run(
             args_unpaper,
             close_fds=True,

And check if the .pnm file is readable in an image editor. It should be the only .pnm file.

Then try passing it to unpaper

unpaper -v --dpi 150.16409 --layout none --mask-scan-size 100 --no-border-align --no-mask-center --no-grayfilter --no-blackfilter --no-deskew file_from_above.pnm output.pnm

Can unpaper process the file?

Based on when "wrong stream" can occur from unpaper, it's likely your unpaper is miscompiled. https://github.com/unpaper/unpaper/blob/a88ed4e47d84a4831ad1828b89e98d289676ecfc/file.c#L56

hurbeana commented 3 years ago

Did the patch, the pnm seems fine and opens (using sxiv). Unpaper indeed fails when passing it the file:

unpaper 6.1
License GPLv2: GNU GPL version 2.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

-------------------------------------------------------------------------------
Processing sheet #1: 000001_pp_deskew.pnm -> output.pnm
unpaper: error: unable to open file 000001_pp_deskew.pnm: wrong stream
Try 'man unpaper' for more information.

Im not quite sure why it it would be miscompiled, because I have just pulled the latest stable release for debian buster from the official repos. Hm interesting, well, I can try and compile unpaper from source and check back!

hurbeana commented 3 years ago

So I went through and tried building unpaper from source

apt remove unpaper
wget https://www.flameeyes.com/files/unpaper-6.1.tar.xz
tar -xf unpaper-6.1.tar.xz
cd unpaper-6.1
apt install gcc make automake pkg-config libavformat-dev libavcodec-dev libavutil-dev
./configure
make
make install
ln -s /usr/local/bin/unpaper /usr/bin/unpaper
unpaper -v --dpi 150.16409 --layout none --mask-scan-size 100 --no-border-align --no-mask-center --no-grayfilter --no-blackfilter --no-deskew 000001_pp_deskew.pnm output.pnm

The unpaper --version reports back the wanted 6.1, and still, even with unpaper built from source doesn't seem to work. At least we know the issue is possibly with unpaper.

Output from ./configure:

checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /bin/mkdir -p
checking for gawk... no
checking for mawk... mawk
checking whether make sets $(MAKE)... no
checking whether make supports nested variables... no
checking whether make supports nested variables... (cached) no
checking whether to enable maintainer-specific portions of Makefiles... yes
checking for style of include used by make... none
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking dependency style of gcc... none
checking for gcc option to accept ISO C99... none needed
checking how to run the C preprocessor... gcc -E
checking for grep that handles long lines and -e... /bin/grep
checking for egrep... /bin/grep -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking minix/config.h usability... no
checking minix/config.h presence... no
checking for minix/config.h... no
checking whether it is safe to define __EXTENSIONS__... yes
checking for stdbool.h that conforms to C99... yes
checking for _Bool... yes
checking for library containing sqrt... -lm
checking for pkg-config... /usr/bin/pkg-config
checking pkg-config is at least version 0.9.0... yes
checking for LIBAV... yes
checking for xsltproc... xsltproc
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating Makefile
config.status: executing depfiles commands

Nothing too bad, maybe im missing something here?

While building there seem to be some interesting warning with regards to the codec which is on the line you pointed out:

file.c:68:5: warning: ‘codec’ is deprecated [-Wdeprecated-declarations]
     if (s->streams[0]->codec->codec_type != AVMEDIA_TYPE_VIDEO)
     ^~
In file included from file.c:29:
/usr/include/x86_64-linux-gnu/libavformat/avformat.h:878:21: note: declared here
     AVCodecContext *codec;
                     ^~~~~
file.c:71:5: warning: ‘avcodec_copy_context’ is deprecated [-Wdeprecated-declarations]
     ret = avcodec_copy_context(avctx, s->streams[0]->codec);
     ^~~
In file included from file.c:28:
/usr/include/x86_64-linux-gnu/libavcodec/avcodec.h:4178:5: note: declared here
 int avcodec_copy_context(AVCodecContext *dest, const AVCodecContext *src);
     ^~~~~~~~~~~~~~~~~~~~
file.c:71:5: warning: ‘codec’ is deprecated [-Wdeprecated-declarations]
     ret = avcodec_copy_context(avctx, s->streams[0]->codec);
     ^~~
In file included from file.c:29:
/usr/include/x86_64-linux-gnu/libavformat/avformat.h:878:21: note: declared here
     AVCodecContext *codec;
                     ^~~~~
file.c:96:5: warning: ‘avcodec_decode_video2’ is deprecated [-Wdeprecated-declarations]
     ret = avcodec_decode_video2(avctx, frame, &got_frame, &pkt);
     ^~~
In file included from file.c:28:
/usr/include/x86_64-linux-gnu/libavcodec/avcodec.h:4771:5: note: declared here
 int avcodec_decode_video2(AVCodecContext *avctx, AVFrame *picture,
     ^~~~~~~~~~~~~~~~~~~~~
file.c: In function ‘saveImage’:
file.c:159:5: warning: ‘filename’ is deprecated [-Wdeprecated-declarations]
     snprintf(out_ctx->filename, sizeof(out_ctx->filename), "%s", filename);
     ^~~~~~~~
In file included from file.c:29:
/usr/include/x86_64-linux-gnu/libavformat/avformat.h:1417:10: note: declared here
     char filename[1024];
          ^~~~~~~~
file.c:159:5: warning: ‘filename’ is deprecated [-Wdeprecated-declarations]
     snprintf(out_ctx->filename, sizeof(out_ctx->filename), "%s", filename);
     ^~~~~~~~
In file included from file.c:29:
/usr/include/x86_64-linux-gnu/libavformat/avformat.h:1417:10: note: declared here
     char filename[1024];
          ^~~~~~~~
file.c:196:5: warning: ‘codec’ is deprecated [-Wdeprecated-declarations]
     codec_ctx = video_st->codec;
     ^~~~~~~~~
In file included from file.c:29:
/usr/include/x86_64-linux-gnu/libavformat/avformat.h:878:21: note: declared here
     AVCodecContext *codec;
                     ^~~~~
file.c:224:5: warning: ‘avcodec_encode_video2’ is deprecated [-Wdeprecated-declarations]
     ret = avcodec_encode_video2(video_st->codec, &pkt, image, &got_packet);
     ^~~
In file included from file.c:28:
/usr/include/x86_64-linux-gnu/libavcodec/avcodec.h:5402:5: note: declared here
 int avcodec_encode_video2(AVCodecContext *avctx, AVPacket *avpkt,
     ^~~~~~~~~~~~~~~~~~~~~
file.c:224:5: warning: ‘codec’ is deprecated [-Wdeprecated-declarations]
     ret = avcodec_encode_video2(video_st->codec, &pkt, image, &got_packet);
     ^~~
In file included from file.c:29:
/usr/include/x86_64-linux-gnu/libavformat/avformat.h:878:21: note: declared here
     AVCodecContext *codec;
                     ^~~~~
file.c:217:5: warning: ignoring return value of ‘avformat_write_header’, declared with attribute warn_unused_result [-Wunused-result]
     avformat_write_header(out_ctx, NULL);
     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  CC       unpaper-imageprocess.o
  CC       unpaper-parse.o
  CC       unpaper-tools.o
  CC       unpaper-unpaper.o
unpaper.c: In function ‘main’:
unpaper.c:947:5: warning: ‘avcodec_register_all’ is deprecated [-Wdeprecated-declarations]
     avcodec_register_all();
     ^~~~~~~~~~~~~~~~~~~~
In file included from unpaper.c:31:
/usr/include/x86_64-linux-gnu/libavcodec/avcodec.h:4102:6: note: declared here
 void avcodec_register_all(void);
      ^~~~~~~~~~~~~~~~~~~~
unpaper.c:948:5: warning: ‘av_register_all’ is deprecated [-Wdeprecated-declarations]
     av_register_all();
     ^~~~~~~~~~~~~~~
In file included from unpaper.c:32:
/usr/include/x86_64-linux-gnu/libavformat/avformat.h:2043:6: note: declared here
 void av_register_all(void);
      ^~~~~~~~~~~~~~~
  CCLD     unpaper

I think that it could be, that my libavformat library is too new, due to the deprecated warnings (especially the warning: ‘codec’ is deprecated one) I'm getting while building unpaper?

jbarlow83 commented 3 years ago

I believe that the size of the image may be involved somehow. It seems like older libavformat is rejecting images that are not an appropriate power of 2, or something like that. The image produced by your file works out to 1241x1754. If I resize the image to 256x256 it works without issue on Debian buster. But simply changes like 1240x1754 do not work; I was not able to discern what libavformat expects. In any event it seems like this is a libavformat bug that in the version that shipped with Debian buster that since been resolved.

Has this issue: ffmpeg 4.1.6, libavformat 58.20.100 (observed by testing Debian buster docker image) Issue fixed: ffmpeg 4.3.1, libavformat 58.45.100 (observed on macOS 10.14)

If you are able to upgrade to a newer Debian, that will likely resolve your issue.

Perhaps you could report this issue to unpaper's github? Some defensive coding against bad versions may be in order.

@Flameeyes pinging you to consider my findings here.


Aside:

These build warnings IMO should be fixed, but not indicative of trouble. The same warnings appear in the Debian buster build: https://buildd.debian.org/status/fetch.php?pkg=unpaper&arch=amd64&ver=6.1-2%2Bb2&stamp=1532015166&raw=0

Debian's packaging does not run unpaper's test suite. If it builds, they accept it. So unfortunately unpaper's inclusion in Debian is no guarantee of it working correctly.

hurbeana commented 3 years ago

Alright thank you, seems to be the problem. I'm currently using a miniconda3 docker image, which has Debian Buster, but I should be able to switch to some other image and install python. I'll forward this issue to the unpaper repo, see if there is something that can be done to maybe alleviate or at least soften the problem and I'll also note the warnings for the buster build! Thank you so much for your help, I'll report back when I can confirm that using a newer image fixes the issue!

hurbeana commented 3 years ago

So, just to make sure, I have tested this whole thing with an alpine linux docker image, using the unpaper provided by the world repo.

docker run --rm -it frolvlad/alpine-miniconda3 /bin/ash
/ # cat /etc/alpine-release
3.12.0
/ # apk add unpaper
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/community/x86_64/APKINDEX.tar.gz
(1/49) Installing sdl2 (2.0.12-r1)
....
(49/49) Installing unpaper (6.1-r1)
Executing busybox-1.31.1-r16.trigger
Executing glibc-bin-2.32-r0.trigger
/usr/glibc-compat/sbin/ldconfig: /usr/glibc-compat/lib/ld-linux-x86-64.so.2 is not a symbolic link

OK: 75 MiB in 66 packages
/ # unpaper --version
6.1
/ # cd root
~ # ls
000001_pp_deskew.pnm
~ # unpaper -v --dpi 150.16409 --layout none --mask-scan-size 100 --no-border-align --no-mask-center --no-grayfilter --no-blackfilter --no-deskew 000001_pp_deskew.pnm output.pnm
unpaper 6.1
License GPLv2: GNU GPL version 2.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

-------------------------------------------------------------------------------
Processing sheet #1: 000001_pp_deskew.pnm -> output.pnm
[ppm_pipe @ 0x56079f691a00] Stream #0: not enough frames to estimate rate; consider increasing probesize
input-file for sheet 1: 000001_pp_deskew.pnm
output-file for sheet 1: output.pnm
sheet size: 1241x1754
...
noise-filter ... deleted 82 clusters.
blur-filter... deleted 0 pixels.
writing output.
[image2 @ 0x56079f692800] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x56079f692800] Encoder did not produce proper pts, making some up.

The output file is fine and openable in an image viewer such as sxiv (although it looks the same, but that doesn't matter, since there is not much to fix in this page of the document).

jbarlow83 commented 3 years ago

Closing since this appears to be a third party in older versions of an unpaper dependency.