jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.37k stars 355 forks source link

[BUG] Colors in PDF are partly false after import #947

Open holzhannes opened 3 years ago

holzhannes commented 3 years ago

Describe the bug After importing an Invoice some parts of the PDF in Paperless are red instead of green and grey in the original PDF.

To Reproduce Upload the example file and you will see the colors change from green to red. To produce the example file, I opened the original file with Affinity Publisher extracted the error zone as .svg and created a new pdf and the error still exists.

Expected behavior PDF colors should not change.

Screenshots in-paperless original

Relevant information

jonaswinkler commented 3 years ago
  1. Try this with OCRmyPDF standalone
  2. If the error persists, report this to OCRmyPDF
holzhannes commented 3 years ago

I tested the example file on my local machine with OCRmyPDF (12.0.0), but the error doesn't occur.

rYR79435 commented 3 years ago

I can reproduce the error on my installation.

Relevant information

Log

[2021-05-04 18:12:59,459] [INFO] [paperless.consumer] Consuming error-zone.pdf
[2021-05-04 18:12:59,461] [DEBUG] [paperless.consumer] Detected mime type: application/pdf
[2021-05-04 18:12:59,473] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser
[2021-05-04 18:12:59,476] [DEBUG] [paperless.consumer] Parsing error-zone.pdf...
[2021-05-04 18:12:59,757] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-upload-xgpguedh
[2021-05-04 18:12:59,898] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/tmp/paperless/paperless-upload-xgpguedh', 'output_file': '/tmp/paperless/paperless-zz10sst0/archive.pdf', 'use_threads': True, 'jobs': '2', 'language': 'deu', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-zz10sst0/sidecar.txt'}
[2021-05-04 18:13:00,910] [DEBUG] [paperless.parsing.tesseract] Incomplete sidecar file: discarding.
[2021-05-04 18:13:01,118] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-zz10sst0/archive.pdf
[2021-05-04 18:13:01,118] [WARNING] [paperless.parsing.tesseract] Encountered an error while running OCR: No text was found in the original document. Attempting force OCR to get the text.
[2021-05-04 18:13:01,119] [DEBUG] [paperless.parsing.tesseract] Fallback: Calling OCRmyPDF with args: {'input_file': '/tmp/paperless/paperless-upload-xgpguedh', 'output_file': '/tmp/paperless/paperless-zz10sst0/archive-fallback.pdf', 'use_threads': True, 'jobs': '2', 'language': 'deu', 'output_type': 'pdfa', 'progress_bar': False, 'force_ocr': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-zz10sst0/sidecar-fallback.txt'}
[2021-05-04 18:14:25,667] [DEBUG] [paperless.parsing.tesseract] Using text from sidecar file
[2021-05-04 18:14:25,668] [WARNING] [paperless.parsing.tesseract] No text was found in /tmp/paperless/paperless-upload-xgpguedh, the content will be empty.
[2021-05-04 18:14:25,668] [DEBUG] [paperless.consumer] Generating thumbnail for error-zone.pdf...
[2021-05-04 18:14:25,673] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-zz10sst0/archive.pdf[0] /tmp/paperless/paperless-zz10sst0/convert.png
[2021-05-04 18:14:31,710] [DEBUG] [paperless.parsing.tesseract] Execute: optipng -silent -o5 /tmp/paperless/paperless-zz10sst0/convert.png -out /tmp/paperless/paperless-zz10sst0/thumb_optipng.png
[2021-05-04 18:14:34,329] [DEBUG] [paperless.consumer] Saving record to database
[2021-05-04 18:14:34,341] [INFO] [paperless.handlers] Assigning correspondent REDACTED error-zone
[2021-05-04 18:14:34,345] [INFO] [paperless.handlers] Assigning document type REDACTED to 2021-05-04 REDACTED error-zone
[2021-05-04 18:14:34,348] [INFO] [paperless.handlers] Tagging "2021-05-04 REDACTED error-zone" with "REDACTED"
[2021-05-04 18:14:34,386] [DEBUG] [paperless.consumer] Deleting file /tmp/paperless/paperless-upload-xgpguedh
[2021-05-04 18:14:34,397] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-zz10sst0
[2021-05-04 18:14:34,398] [INFO] [paperless.consumer] Document 2021-05-04 REDACTED error-zone consumption finished
[2021-05-04 18:15:05,450] [WARNING] [paperless.parsing.tesseract] Error while reading metadata {http://ns.adobe.com/pdfx/1.3/}trapped: false. Error: 'http://ns.adobe.com/pdfx/1.3/'
[2021-05-04 18:15:48,996] [DEBUG] [paperless.handlers] Deleted file /usr/src/paperless/src/../media/documents/originals/2021/REDACTED/error-zone.pdf.
[2021-05-04 18:15:48,996] [DEBUG] [paperless.handlers] Deleted file /usr/src/paperless/src/../media/documents/archive/2021/REDACTED/error-zone.pdf.
[2021-05-04 18:15:48,997] [DEBUG] [paperless.handlers] Deleted file /usr/src/paperless/src/../media/documents/thumbnails/0000149.png.

The last 4 lines seem to be from me opening (edit view) and then deleting the file.