[BUG] Text deleted from specific PDF

MateEke commented 3 years ago

Describe the bug I have a pdf document which becomes empty after I upload it. I don't see any problem with the original file, and no errors in the log. The pdf holds some sensitive data, i can send it in private message.

To Reproduce

Upload specific file
Open document
See error

Expected behavior The pdf doesn't become empty.

Screenshots If applicable, add screenshots to help explain your problem.

Webserver logs

[2021-04-08 08:15:32,855] [INFO] [paperless.consumer] Consuming Edigital_eSzámla_17465900_2021-04-08-08-03-29.pdf

[2021-04-08 08:15:32,860] [DEBUG] [paperless.consumer] Detected mime type: application/pdf

[2021-04-08 08:15:32,913] [DEBUG] [paperless.consumer] Parser: RasterisedDocumentParser

[2021-04-08 08:15:32,919] [DEBUG] [paperless.consumer] Parsing Edigital_eSzámla_17465900_2021-04-08-08-03-29.pdf...

[2021-04-08 08:15:33,181] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-upload-shy1ykjq

[2021-04-08 08:15:33,537] [DEBUG] [paperless.parsing.tesseract] Calling OCRmyPDF with args: {'input_file': '/tmp/paperless/paperless-upload-shy1ykjq', 'output_file': '/tmp/paperless/paperless-rbwszbsn/archive.pdf', 'use_threads': True, 'jobs': 4, 'language': 'hun+eng', 'output_type': 'pdfa', 'progress_bar': False, 'skip_text': True, 'clean': True, 'deskew': True, 'rotate_pages': True, 'rotate_pages_threshold': 12.0, 'sidecar': '/tmp/paperless/paperless-rbwszbsn/sidecar.txt'}

[2021-04-08 08:15:35,833] [DEBUG] [paperless.parsing.tesseract] Incomplete sidecar file: discarding.

[2021-04-08 08:15:35,850] [DEBUG] [paperless.parsing.tesseract] Extracted text from PDF file /tmp/paperless/paperless-rbwszbsn/archive.pdf

[2021-04-08 08:15:35,851] [DEBUG] [paperless.consumer] Generating thumbnail for Edigital_eSzámla_17465900_2021-04-08-08-03-29.pdf...

[2021-04-08 08:15:35,857] [DEBUG] [paperless.parsing] Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /tmp/paperless/paperless-rbwszbsn/archive.pdf[0] /tmp/paperless/paperless-rbwszbsn/convert.png

[2021-04-08 08:15:37,048] [DEBUG] [paperless.parsing.tesseract] Execute: optipng -silent -o5 /tmp/paperless/paperless-rbwszbsn/convert.png -out /tmp/paperless/paperless-rbwszbsn/thumb_optipng.png

[2021-04-08 08:15:46,742] [DEBUG] [paperless.consumer] Saving record to database

[2021-04-08 08:15:46,876] [INFO] [paperless.handlers] Assigning correspondent Whirlpool to 2021-04-08 Edigital_eSzámla_17465900_2021-04-08-08-03-29

[2021-04-08 08:15:46,881] [INFO] [paperless.handlers] Assigning document type Contract to 2021-04-08 Whirlpool Edigital_eSzámla_17465900_2021-04-08-08-03-29

[2021-04-08 08:15:46,888] [INFO] [paperless.handlers] Tagging "2021-04-08 Whirlpool Edigital_eSzámla_17465900_2021-04-08-08-03-29" with "[redacted]"

[2021-04-08 08:15:46,985] [DEBUG] [paperless.consumer] Deleting file /tmp/paperless/paperless-upload-shy1ykjq

[2021-04-08 08:15:46,993] [DEBUG] [paperless.parsing.tesseract] Deleting directory /tmp/paperless/paperless-rbwszbsn

[2021-04-08 08:15:46,993] [INFO] [paperless.consumer] Document 2021-04-08 Whirlpool Edigital_eSzámla_17465900_2021-04-08-08-03-29 consumption finished

Relevant information

Host OS of the machine running paperless: Ubuntu 20.04
Browser chrome
Version 1.4.0
Installation method: docker
Any configuration changes you made in docker-compose.yml, docker-compose.env or paperless.conf.

Spoker commented 3 years ago

I have a similar issue with one type of pdf. If I open the original, it's fine, but the pdf/a version (i think) that paperless makes is empty.

jonaswinkler commented 3 years ago

Try using OCRmyPDF (https://github.com/jbarlow83/OCRmyPDF) on the original PDF document. You can install that with PIP, or use Docker. If this also removes the text, it's an issue with OCRmyPDF, and you should report the issue there.

MateEke commented 3 years ago

I have tried with this command:

ekemate@ragnar-pop-os:~$ ocrmypdf -l hun+eng --output-type pdfa --skip-text --clean ~/Downloads/Edigital_eSzámla_17465900_2021-04-08-08-03-29.pdf test.pdf
Scanning contents: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 40.16page/s]
This PDF has a fillable form. Chances are it is a pure digital document that does not need OCR.
Use the option --force-ocr to produce an image of the form and all filled form fields. The output PDF will be 'flattened' and will no longer be fillable.
Using Tesseract OpenMP thread limit 3
    1 skipping all processing on this page                                                                                                                                                                                                    
OCR: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.0/1.0 [00:00<00:00, 149.77page/s]
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.0%
Output file is a PDF/A-2B (as expected)

The output has same strange characters (I don't know why - edit: i think the character Ő causes it), but definitely not empty:

jonaswinkler commented 3 years ago

Please try with "--deskew --rotate-pages" in addition as well.

MateEke commented 3 years ago

The result is the same as before.

jonaswinkler commented 3 years ago

Thanks. I'll see what I can do.

jonaswinkler commented 3 years ago

Alright, so what I did is the following:

Make the files available in the docker container, I just moved them into the export folder that gets mounted to /usr/src/paperless/export.
Start paperless.
docker-compose exec webserver /bin/bash gets me a shell in the container.
cd to the export directory.
ocrmypdf input.pdf output.pdf --skip-text yields the following:

Postprocessing...
GPL Ghostscript 9.53.3 (2020-10-01)                                                                                                                                                                                                           
Copyright (C) 2020 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 1.
Page 1
Can't find (or can't open) font file /usr/share/ghostscript/9.53.3/Resource/Font//usr/share/g.
Can't find (or can't open) font file Arial-BoldMT.
Can't find (or can't open) font file /usr/share/ghostscript/9.53.3/Resource/Font//usr/share/g.
Can't find (or can't open) font file Arial-BoldMT.
Querying operating system for font files...
Can't find (or can't open) font file /usr/share/ghostscript/9.53.3/Resource/Font//usr/share/g.
Can't find (or can't open) font file Arial-BoldMT.
Didn't find this font on the system!
Substituting font Helvetica-Bold for Arial-BoldMT.
Loading NimbusSans-Bold font from /usr/share/ghostscript/9.53.3/Resource/Font/NimbusSans-Bold... 5015192 3517120 2996192 1593499 4 done.
Can't find CID font "Arial".
Attempting to substitute CID font /Adobe-Identity for /Arial, see doc/Use.htm#CIDFontSubstitution.
The substitute CID font "Adobe-Identity" is not provided either. attempting to use fallback CIDFont.See doc/Use.htm#CIDFontSubstitution.
The fallback CID font "CIDFallBack" is not provided.  Finally attempting to use ArtifexBullet.

 Error reading a content stream. The page may be incomplete.                                                                                                                                                                                  
               Output may be incorrect.

 Error: File did not complete the page properly and may be damaged.                                                                                                                                                                           
               Output may be incorrect.
GPL Ghostscript 9.53.3: Annotation set to non-printing,
 not permitted in PDF/A, annotation will not be present in output file

Apparently some fonts are missing, but I can't figure out which package is required here.

MateEke commented 3 years ago

I think it can be solved with font substitution without installing any additional packages. I had to substitute Arial CID font with the already installed liberation font alternative. (simply installing the arial font wasn't enough, and added lot of unnecessary bloat to the image) Ghostscript manual

Here is what I did: I have created a new Dockerfile based on the original:

ROM jonaswinkler/paperless-ng:latest

RUN echo "/Arial << /FileType /TrueType /Path (/usr/share/fonts/truetype/liberation/LiberationSans-Regular.ttf) /SubfontID 0 /CSI [(Identity) 6] >> ;" >> /usr/share/ghostscript/9.53.3/Resource/Init/cidfmap

After that running the same command as you:

root@4eef1a430e6c:/usr/src/paperless/src# ocrmypdf input.pdf output.pdf --skip-text
Scanning contents: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  9.28page/s]
This PDF has a fillable form. Chances are it is a pure digital document that does not need OCR.
Use the option --force-ocr to produce an image of the form and all filled form fields. The output PDF will be 'flattened' and will no longer be fillable.
    1 skipping all processing on this page                                                                                                                                                                                                    
OCR: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.0/1.0 [00:00<00:00,  9.27page/s]
Postprocessing...
PDF/A conversion: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.56page/s]
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.0%
Output file is a PDF/A-2B (as expected)

Now I only have problem with the encoding of some characters (eg. Ő). I stil don't exactly know how to fix it, but I think the /CSI [(Identity) 6] part of the inserted command is the relevant bit here.

MateEke commented 3 years ago

Ok, I have experimented with it for a little bit further: for 100% correct encoding you have to install Arial font. The 100% correct Dockerfile:

FROM jonaswinkler/paperless-ng:latest

RUN sed -i'.bak' 's/$/ contrib/' /etc/apt/sources.list
RUN apt-get update; apt-get install -y --no-install-recommends ttf-mscorefonts-installer
RUN echo "/Arial <</FileType /TrueType /Path (/usr/share/fonts/truetype/msttcorefonts/arial.ttf) /SubfontID 0 /CSI [(Identity) 0] >> ;" >> /usr/share/ghostscript/9.53.3/Resource/Init/cidfmap

jonaswinkler / paperless-ng

[BUG] Text deleted from specific PDF #881