Open trueimage opened 6 years ago
My guess is that it's wider than 32k pixels, and Leptonica (Tesseract's image library) does not support JPEGs that are wider than 32k pixels since they are not standard. Please check if this is the case with a program like pdfimages -list
which reports on the images embedded in a PDF.
If so the workaround would be to use some other program to convert the oversized image to JPEG2000.
I get a similar issue with this publication (PDF). The OCR is messed up in the PDF so I'm trying to redo it. Here's my commandline:
$ docker run -v "$(pwd):/pwd" -it jbarlow83/ocrmypdf /pwd/in.pdf /pwd/out.pdf --force-ocr
Scan: 0%| | 0/20 [00:00<?, ?page/s]
Scan: 5%|███████▎ | 1/20 [00:00<00:03, 5.45page/s]
Scan: 10%|██████████████▌ | 2/20 [00:00<00:03, 5.40page/s]
Scan: 15%|█████████████████████▊ | 3/20 [00:00<00:03, 5.11page/s]
Scan: 20%|█████████████████████████████ | 4/20 [00:00<00:03, 4.65page/s]
Scan: 25%|████████████████████████████████████▎ | 5/20 [00:01<00:03, 4.46page/s]
Scan: 30%|███████████████████████████████████████████▌ | 6/20 [00:01<00:03, 3.61page/s]
Scan: 35%|██████████████████████████████████████████████████▊ | 7/20 [00:01<00:03, 3.50page/s]
Scan: 40%|██████████████████████████████████████████████████████████ | 8/20 [00:02<00:03, 3.55page/s]
Scan: 45%|█████████████████████████████████████████████████████████████████▎ | 9/20 [00:02<00:03, 3.66page/s]
Scan: 50%|████████████████████████████████████████████████████████████████████████ | 10/20 [00:02<00:02, 3.74page/s]
Scan: 55%|███████████████████████████████████████████████████████████████████████████████▏ | 11/20 [00:02<00:02, 3.79page/s]
Scan: 60%|██████████████████████████████████████████████████████████████████████████████████████▍ | 12/20 [00:03<00:02, 3.36page/s]
Scan: 65%|█████████████████████████████████████████████████████████████████████████████████████████████▌ | 13/20 [00:03<00:01, 3.63page/s]
Scan: 70%|████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 14/20 [00:03<00:01, 3.58page/s]
Scan: 75%|████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 15/20 [00:04<00:01, 3.46page/s]
Scan: 80%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 16/20 [00:04<00:01, 3.24page/s]
Scan: 85%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 17/20 [00:04<00:00, 3.45page/s]
Scan: 90%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 18/20 [00:04<00:00, 3.46page/s]
Scan: 95%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 19/20 [00:05<00:00, 3.81page/s]
Scan: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:05<00:00, 4.05page/s]
Scan: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:05<00:00, 3.75page/s]
INFO - Start processing 2 pages concurrently
OCR: 0%| | 0.0/20.0 [00:00<?, ?page/s]
INFO - 2: page already has text! - rasterizing text and running OCR anyway
OCR: 0%| | 0.0/20.0 [00:00<?, ?page/s]
INFO - 1: page already has text! - rasterizing text and running OCR anyway
OCR: 0%| | 0.0/20.0 [00:00<?, ?page/s]
INFO - 3: page already has text! - rasterizing text and running OCR anyway
OCR: 0%| | 0.0/20.0 [00:07<?, ?page/s]
OCR: 0%| | 0.0/20.0 [00:07<?, ?page/s]
ERROR - An exception occurred while executing the pipeline
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.7/dist-packages/ocrmypdf/_sync.py", line 102, in exec_page_sync
remove_vectors=False,
File "/usr/local/lib/python3.7/dist-packages/ocrmypdf/_pipeline.py", line 446, in rasterize
filter_vector=remove_vectors,
File "/usr/local/lib/python3.7/dist-packages/ocrmypdf/exec/ghostscript.py", line 207, in rasterize_pdf
with Image.open(BytesIO(p.stdout)) as im:
File "/usr/local/lib/python3.7/dist-packages/PIL/Image.py", line 2847, in open
im = _open_core(fp, filename, prefix)
File "/usr/local/lib/python3.7/dist-packages/PIL/Image.py", line 2834, in _open_core
_decompression_bomb_check(im.size)
File "/usr/local/lib/python3.7/dist-packages/PIL/Image.py", line 2759, in _decompression_bomb_check
"could be decompression bomb DOS attack." % (pixels, 2 * MAX_IMAGE_PIXELS)
PIL.Image.DecompressionBombError: Image size (1744934400 pixels) exceeds limit of 256000000 pixels, could be decompression bomb DOS attack.
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/ocrmypdf/_sync.py", line 391, in run_pipeline
exec_concurrent(context)
File "/usr/local/lib/python3.7/dist-packages/ocrmypdf/_sync.py", line 281, in exec_concurrent
page_result = results.next()
File "/usr/lib/python3.7/multiprocessing/pool.py", line 748, in next
raise value
PIL.Image.DecompressionBombError: Image size (1744934400 pixels) exceeds limit of 256000000 pixels, could be decompression bomb DOS attack.
Use the parameter --max-image-mpixels
to confirm that you really want an image of that size.
Although the real issue here is that there's a small bit of color on a mostly black and white page, which causes the entire page to be 'promoted' to color. I don't have the technology to segment a page into colorspace regions and apply appropriate compression individually. If you eliminate that color image the whole thing will compress much better.
I'm trying to process a file which has a page which is 102 inches x 32 inches. This file contains all pages with a text layer, so they should be skipped. here is my command line
C:\Users\user>docker run -v /c/Users/user/ocr/bad:/home/docker ocrmypdf --deskew --rotate-pages --clean --skip-text bad4.pdf bad4ocr.pdf
Here is the output along with the error I get