ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
13.69k stars 997 forks source link

Increase max image size? or avoid DecompressionBombWarning #230

Open trueimage opened 6 years ago

trueimage commented 6 years ago

I'm trying to process a file which has a page which is 102 inches x 32 inches. This file contains all pages with a text layer, so they should be skipped. here is my command line

C:\Users\user>docker run -v /c/Users/user/ocr/bad:/home/docker ocrmypdf --deskew --rotate-pages --clean --skip-text bad4.pdf bad4ocr.pdf

Here is the output along with the error I get

   INFO -    4: page already has text! – skipping all processing on this page
   INFO -    1: page already has text! – skipping all processing on this page
   INFO -    2: page already has text! – skipping all processing on this page
   INFO -    3: page already has text! – skipping all processing on this page
   INFO -    1: page is facing ⇧, confidence 3.48 - no change
   INFO -    2: page is facing ⇧, confidence 17.62 - rotation appears correct
   INFO -    3: page is facing ⇧, confidence 13.72 - no change
/usr/lib/python3/dist-packages/PIL/Image.py:2371: DecompressionBombWarning: Image size (1600000000 pixels) exceeds limit of 128000000 pixels, could be decompression bomb DOS attack.
  DecompressionBombWarning)
  ERROR - Error occurred while running this command:
(Command '['tesseract', '-l', 'osd', '--psm', '0', '/tmp/com.github.ocrmypdf.hchlgqxw/000004.skip.preview.jpg', 'stdout']' died with <Signals.SIGKILL: 9>.)
jbarlow83 commented 6 years ago

My guess is that it's wider than 32k pixels, and Leptonica (Tesseract's image library) does not support JPEGs that are wider than 32k pixels since they are not standard. Please check if this is the case with a program like pdfimages -list which reports on the images embedded in a PDF.

If so the workaround would be to use some other program to convert the oversized image to JPEG2000.

0x326 commented 4 years ago

I get a similar issue with this publication (PDF). The OCR is messed up in the PDF so I'm trying to redo it. Here's my commandline:

$ docker run -v "$(pwd):/pwd" -it jbarlow83/ocrmypdf /pwd/in.pdf /pwd/out.pdf --force-ocr
Scan:   0%|                                                                                                                                                         | 0/20 [00:00<?, ?page/s]
Scan:   5%|███████▎                                                                                                                                         | 1/20 [00:00<00:03,  5.45page/s]
Scan:  10%|██████████████▌                                                                                                                                  | 2/20 [00:00<00:03,  5.40page/s]
Scan:  15%|█████████████████████▊                                                                                                                           | 3/20 [00:00<00:03,  5.11page/s]
Scan:  20%|█████████████████████████████                                                                                                                    | 4/20 [00:00<00:03,  4.65page/s]
Scan:  25%|████████████████████████████████████▎                                                                                                            | 5/20 [00:01<00:03,  4.46page/s]
Scan:  30%|███████████████████████████████████████████▌                                                                                                     | 6/20 [00:01<00:03,  3.61page/s]
Scan:  35%|██████████████████████████████████████████████████▊                                                                                              | 7/20 [00:01<00:03,  3.50page/s]
Scan:  40%|██████████████████████████████████████████████████████████                                                                                       | 8/20 [00:02<00:03,  3.55page/s]
Scan:  45%|█████████████████████████████████████████████████████████████████▎                                                                               | 9/20 [00:02<00:03,  3.66page/s]
Scan:  50%|████████████████████████████████████████████████████████████████████████                                                                        | 10/20 [00:02<00:02,  3.74page/s]
Scan:  55%|███████████████████████████████████████████████████████████████████████████████▏                                                                | 11/20 [00:02<00:02,  3.79page/s]
Scan:  60%|██████████████████████████████████████████████████████████████████████████████████████▍                                                         | 12/20 [00:03<00:02,  3.36page/s]
Scan:  65%|█████████████████████████████████████████████████████████████████████████████████████████████▌                                                  | 13/20 [00:03<00:01,  3.63page/s]
Scan:  70%|████████████████████████████████████████████████████████████████████████████████████████████████████▊                                           | 14/20 [00:03<00:01,  3.58page/s]
Scan:  75%|████████████████████████████████████████████████████████████████████████████████████████████████████████████                                    | 15/20 [00:04<00:01,  3.46page/s]
Scan:  80%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                            | 16/20 [00:04<00:01,  3.24page/s]
Scan:  85%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                     | 17/20 [00:04<00:00,  3.45page/s]
Scan:  90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌              | 18/20 [00:04<00:00,  3.46page/s]
Scan:  95%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊       | 19/20 [00:05<00:00,  3.81page/s]
Scan: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:05<00:00,  4.05page/s]
Scan: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:05<00:00,  3.75page/s]

   INFO - Start processing 2 pages concurrently
OCR:   0%|                                                                                                                                                      | 0.0/20.0 [00:00<?, ?page/s]

   INFO -    2: page already has text! - rasterizing text and running OCR anyway
OCR:   0%|                                                                                                                                                      | 0.0/20.0 [00:00<?, ?page/s]

   INFO -    1: page already has text! - rasterizing text and running OCR anyway
OCR:   0%|                                                                                                                                                      | 0.0/20.0 [00:00<?, ?page/s]

   INFO -    3: page already has text! - rasterizing text and running OCR anyway
OCR:   0%|                                                                                                                                                      | 0.0/20.0 [00:07<?, ?page/s]
OCR:   0%|                                                                                                                                                      | 0.0/20.0 [00:07<?, ?page/s]

  ERROR - An exception occurred while executing the pipeline
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.7/dist-packages/ocrmypdf/_sync.py", line 102, in exec_page_sync
    remove_vectors=False,
  File "/usr/local/lib/python3.7/dist-packages/ocrmypdf/_pipeline.py", line 446, in rasterize
    filter_vector=remove_vectors,
  File "/usr/local/lib/python3.7/dist-packages/ocrmypdf/exec/ghostscript.py", line 207, in rasterize_pdf
    with Image.open(BytesIO(p.stdout)) as im:
  File "/usr/local/lib/python3.7/dist-packages/PIL/Image.py", line 2847, in open
    im = _open_core(fp, filename, prefix)
  File "/usr/local/lib/python3.7/dist-packages/PIL/Image.py", line 2834, in _open_core
    _decompression_bomb_check(im.size)
  File "/usr/local/lib/python3.7/dist-packages/PIL/Image.py", line 2759, in _decompression_bomb_check
    "could be decompression bomb DOS attack." % (pixels, 2 * MAX_IMAGE_PIXELS)
PIL.Image.DecompressionBombError: Image size (1744934400 pixels) exceeds limit of 256000000 pixels, could be decompression bomb DOS attack.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/ocrmypdf/_sync.py", line 391, in run_pipeline
    exec_concurrent(context)
  File "/usr/local/lib/python3.7/dist-packages/ocrmypdf/_sync.py", line 281, in exec_concurrent
    page_result = results.next()
  File "/usr/lib/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
PIL.Image.DecompressionBombError: Image size (1744934400 pixels) exceeds limit of 256000000 pixels, could be decompression bomb DOS attack.
jbarlow83 commented 4 years ago

Use the parameter --max-image-mpixels to confirm that you really want an image of that size.

Although the real issue here is that there's a small bit of color on a mostly black and white page, which causes the entire page to be 'promoted' to color. I don't have the technology to segment a page into colorspace regions and apply appropriate compression individually. If you eliminate that color image the whole thing will compress much better.