jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.37k stars 357 forks source link

DecompressionBombError #1428

Open rototom opened 2 years ago

rototom commented 2 years ago

With a specific docuemnt I get the following:

2021-03-24 IMG_20210324_151242.pdf: Error while consuming document 2021-03-24 IMG_20210324_151242.pdf: DecompressionBombError: Image size (491520000 pixels) exceeds limit of 256000000 pixels, could be decompression bomb DOS attack. : Traceback (most recent call last):
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 247, in parse
raise NoTextFoundException(
paperless_tesseract.parsers.NoTextFoundException: No text was found in the original document

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 276, in parse
ocrmypdf.ocr(**args)
File "/usr/local/lib/python3.9/site-packages/ocrmypdf/api.py", line 340, in ocr
return run_pipeline(options=options, plugin_manager=plugin_manager, api=True)
File "/usr/local/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 374, in run_pipeline
exec_concurrent(context, executor)
File "/usr/local/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 271, in exec_concurrent
executor(
File "/usr/local/lib/python3.9/site-packages/ocrmypdf/_concurrent.py", line 82, in __call__
self._execute(
File "/usr/local/lib/python3.9/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 132, in _execute
for result in results:
File "/usr/local/lib/python3.9/multiprocessing/pool.py", line 870, in next
raise value
File "/usr/local/lib/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 184, in exec_page_sync
rasterize_preview_out = rasterize_preview(page_context.origin, page_context)
File "/usr/local/lib/python3.9/site-packages/ocrmypdf/_pipeline.py", line 338, in rasterize_preview
page_context.plugin_manager.hook.rasterize_pdf_page(
File "/usr/local/lib/python3.9/site-packages/pluggy/hooks.py", line 286, in __call__
return self._hookexec(self, self.get_hookimpls(), kwargs)
File "/usr/local/lib/python3.9/site-packages/pluggy/manager.py", line 93, in _hookexec
return self._inner_hookexec(hook, methods, kwargs)
File "/usr/local/lib/python3.9/site-packages/pluggy/manager.py", line 84, in <lambda>
self._inner_hookexec = lambda hook, methods, kwargs: hook.multicall(
File "/usr/local/lib/python3.9/site-packages/pluggy/callers.py", line 208, in _multicall
return outcome.get_result()
File "/usr/local/lib/python3.9/site-packages/pluggy/callers.py", line 80, in get_result
raise ex[1].with_traceback(ex[2])
File "/usr/local/lib/python3.9/site-packages/pluggy/callers.py", line 187, in _multicall
res = hook_impl.function(*args)
File "/usr/local/lib/python3.9/site-packages/ocrmypdf/builtin_plugins/ghostscript.py", line 68, in rasterize_pdf_page
ghostscript.rasterize_pdf(
File "/usr/local/lib/python3.9/site-packages/ocrmypdf/_exec/ghostscript.py", line 124, in rasterize_pdf
with Image.open(BytesIO(p.stdout)) as im:
File "/usr/local/lib/python3.9/site-packages/PIL/Image.py", line 3009, in open
im = _open_core(fp, filename, prefix, formats)
File "/usr/local/lib/python3.9/site-packages/PIL/Image.py", line 2996, in _open_core
_decompression_bomb_check(im.size)
File "/usr/local/lib/python3.9/site-packages/PIL/Image.py", line 2905, in _decompression_bomb_check
raise DecompressionBombError(
PIL.Image.DecompressionBombError: Image size (491520000 pixels) exceeds limit of 256000000 pixels, could be decompression bomb DOS attack.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/asgiref/sync.py", line 288, in main_wrap
raise exc_info[1]
File "/usr/src/paperless/src/documents/consumer.py", line 248, in try_consume_file
document_parser.parse(self.path, mime_type, self.filename)
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 286, in parse
raise ParseError(f"{e.__class__.__name__}: {str(e)}")
documents.parsers.ParseError: DecompressionBombError: Image size (491520000 pixels) exceeds limit of 256000000 pixels, could be decompression bomb DOS attack.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/django_q/cluster.py", line 432, in worker
res = f(*task["args"], **task["kwargs"])
File "/usr/src/paperless/src/documents/tasks.py", line 74, in consume_file
document = Consumer().try_consume_file(
File "/usr/src/paperless/src/documents/consumer.py", line 266, in try_consume_file
self._fail(
File "/usr/src/paperless/src/documents/consumer.py", line 70, in _fail
raise ConsumerError(f"{self.filename}: {log_message or message}")
documents.consumer.ConsumerError: 2021-03-24 IMG_20210324_151242.pdf: Error while consuming document 2021-03-24 IMG_20210324_151242.pdf: DecompressionBombError: Image size (491520000 pixels) exceeds limit of 256000000 pixels, could be decompression bomb DOS attack.
CoconutMacaroon commented 2 years ago

It says what the issue is: DecompressionBombError: Image size (491520000 pixels) exceeds limit of 256000000 pixels, could be decompression bomb DOS attack. It is warning you that there is a really big image, that could be designed to crash something. See this Wikipedia article. If you're really sure that your image of a whopping 491520000 pixels is not malicious, see this Stack Overflow question for how to fix it

rototom commented 2 years ago

Its just a quite normal one-page scan. I do this scans on daily basis and never had this problem...

dvonessen commented 2 years ago

Hi!, I also do encounter this problem on several PDF files there are taken with an Android app called Scanbot. The files are relatively small in size like 4-5Mb but Paperless-NG doesen't process this File because of the DecompressionBombError exception. I tried to use @CoconutMacaroon link to Stack Overflow and added the lines to the parsers.py of the paperless_tesseract and paperless_text files.

Maybe @jonaswinkler can help or assist, if the project is still active.

Coder67844678 commented 2 years ago

Hej :)

Did anyone found a fix for that? I have the same problem.

kg333 commented 2 years ago

Same issue here. Unless my power company is unintentionally delivering malware with my monthly statement, it seems likely that Paperless-NG is getting a false positive. Unfortunately I'm unable to provide the file since it has my address.

Here are the logs. Notably, there's an OCR detection failure prior to the DecompressionBombError. ``` [2022-05-25 09:50:05,049] [DEBUG] [paperless.management.consumer] Waiting for file /usr/src/paperless/src/../consume/20220525_aes.pdf to remain unmodified [2022-05-25 09:50:10,056] [INFO] [paperless.management.consumer] Adding /usr/src/paperless/src/../consume/20220525_aes.pdf to the task queue. [2022-05-25 09:50:10,183] [INFO] [paperless.consumer] Consuming 20220525_aes.pdf [2022-05-25 09:50:10,198] [DEBUG] [paperless.consumer] Detected mime type: application/pdf [2022-05-25 09:50:10,207] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser [2022-05-25 09:50:10,212] [DEBUG] [paperless.consumer] Parsing 20220525_aes.pdf... [2022-05-25 09:50:10,791] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /usr/src/paperless/src/../consume/20220525_aes.pdf [2022-05-25 09:50:10,898] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/usr/src/paperless/src/../consume/20220525_aes.pdf', 'output_file': '/tmp/paperless/paperless-c1t5nbd_/archive.pdf', 'use_threads': True, 'jobs': 2, 'language': 'eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-c1t5nbd_/sidecar.txt'} [2022-05-25 09:50:17,946] [DEBUG] [paperless.parsing.tesseract] Incomplete sidecar file: discarding. [2022-05-25 09:50:18,014] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-c1t5nbd_/archive.pdf [2022-05-25 09:50:18,014] [WARNING] [paperless.parsing.tesseract] Encountered an error while running OCR: No text was found in the original document. Attempting force OCR to get the text. [2022-05-25 09:50:18,014] [DEBUG] [paperless.parsing.tesseract] Fallback: Calling OCRmyPDF with args: {'input_file': '/usr/src/paperless/src/../consume/20220525_aes.pdf', 'output_file': '/tmp/paperless/paperless-c1t5nbd_/archive-fallback.pdf', 'use_threads': True, 'jobs': 2, 'language': 'eng', 'output_type': 'pdfa', 'progress_bar': False, 'force_ocr': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-c1t5nbd_/sidecar-fallback.txt'} [2022-05-25 09:51:16,535] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-c1t5nbd_ [2022-05-25 09:51:16,546] [ERROR] [paperless.consumer] Error while consuming document 20220525_aes.pdf: DecompressionBombError: Image size (472007070 pixels) exceeds limit of 256000000 pixels, could be decompression bomb DOS attack. Traceback (most recent call last): File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 247, in parse raise NoTextFoundException( paperless_tesseract.parsers.NoTextFoundException: No text was found in the original document During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 276, in parse ocrmypdf.ocr(**args) File "/usr/local/lib/python3.9/site-packages/ocrmypdf/api.py", line 340, in ocr return run_pipeline(options=options, plugin_manager=plugin_manager, api=True) File "/usr/local/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 374, in run_pipeline exec_concurrent(context, executor) File "/usr/local/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 271, in exec_concurrent executor( File "/usr/local/lib/python3.9/site-packages/ocrmypdf/_concurrent.py", line 82, in __call__ self._execute( File "/usr/local/lib/python3.9/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 132, in _execute for result in results: File "/usr/local/lib/python3.9/multiprocessing/pool.py", line 870, in next raise value File "/usr/local/lib/python3.9/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "/usr/local/lib/python3.9/site-packages/ocrmypdf/_sync.py", line 184, in exec_page_sync rasterize_preview_out = rasterize_preview(page_context.origin, page_context) File "/usr/local/lib/python3.9/site-packages/ocrmypdf/_pipeline.py", line 338, in rasterize_preview page_context.plugin_manager.hook.rasterize_pdf_page( File "/usr/local/lib/python3.9/site-packages/pluggy/hooks.py", line 286, in __call__ return self._hookexec(self, self.get_hookimpls(), kwargs) File "/usr/local/lib/python3.9/site-packages/pluggy/manager.py", line 93, in _hookexec return self._inner_hookexec(hook, methods, kwargs) File "/usr/local/lib/python3.9/site-packages/pluggy/manager.py", line 84, in self._inner_hookexec = lambda hook, methods, kwargs: hook.multicall( File "/usr/local/lib/python3.9/site-packages/pluggy/callers.py", line 208, in _multicall return outcome.get_result() File "/usr/local/lib/python3.9/site-packages/pluggy/callers.py", line 80, in get_result raise ex[1].with_traceback(ex[2]) File "/usr/local/lib/python3.9/site-packages/pluggy/callers.py", line 187, in _multicall res = hook_impl.function(*args) File "/usr/local/lib/python3.9/site-packages/ocrmypdf/builtin_plugins/ghostscript.py", line 68, in rasterize_pdf_page ghostscript.rasterize_pdf( File "/usr/local/lib/python3.9/site-packages/ocrmypdf/_exec/ghostscript.py", line 124, in rasterize_pdf with Image.open(BytesIO(p.stdout)) as im: File "/usr/local/lib/python3.9/site-packages/PIL/Image.py", line 3009, in open im = _open_core(fp, filename, prefix, formats) File "/usr/local/lib/python3.9/site-packages/PIL/Image.py", line 2996, in _open_core _decompression_bomb_check(im.size) File "/usr/local/lib/python3.9/site-packages/PIL/Image.py", line 2905, in _decompression_bomb_check raise DecompressionBombError( PIL.Image.DecompressionBombError: Image size (472007070 pixels) exceeds limit of 256000000 pixels, could be decompression bomb DOS attack. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/src/paperless/src/documents/consumer.py", line 248, in try_consume_file document_parser.parse(self.path, mime_type, self.filename) File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 286, in parse raise ParseError(f"{e.__class__.__name__}: {str(e)}") documents.parsers.ParseError: DecompressionBombError: Image size (472007070 pixels) exceeds limit of 256000000 pixels, could be decompression bomb DOS attack. ```

EDIT: One additional detail - running the problem PDF through a PDF print utility to regenerate it results in a file that Paperless-NG can successfully process. I used the "Microsoft Print to PDF" option in Windows.