Open MateEke opened 3 years ago
I have a similar issue with one type of pdf. If I open the original, it's fine, but the pdf/a version (i think) that paperless makes is empty.
Try using OCRmyPDF (https://github.com/jbarlow83/OCRmyPDF) on the original PDF document. You can install that with PIP, or use Docker. If this also removes the text, it's an issue with OCRmyPDF, and you should report the issue there.
I have tried with this command:
ekemate@ragnar-pop-os:~$ ocrmypdf -l hun+eng --output-type pdfa --skip-text --clean ~/Downloads/Edigital_eSzámla_17465900_2021-04-08-08-03-29.pdf test.pdf
Scanning contents: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 40.16page/s]
This PDF has a fillable form. Chances are it is a pure digital document that does not need OCR.
Use the option --force-ocr to produce an image of the form and all filled form fields. The output PDF will be 'flattened' and will no longer be fillable.
Using Tesseract OpenMP thread limit 3
1 skipping all processing on this page
OCR: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.0/1.0 [00:00<00:00, 149.77page/s]
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.0%
Output file is a PDF/A-2B (as expected)
The output has same strange characters (I don't know why - edit: i think the character Ő causes it), but definitely not empty:
Please try with "--deskew --rotate-pages" in addition as well.
The result is the same as before.
Thanks. I'll see what I can do.
Alright, so what I did is the following:
docker-compose exec webserver /bin/bash
gets me a shell in the container.ocrmypdf input.pdf output.pdf --skip-text
yields the following:Postprocessing...
GPL Ghostscript 9.53.3 (2020-10-01)
Copyright (C) 2020 Artifex Software, Inc. All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 1.
Page 1
Can't find (or can't open) font file /usr/share/ghostscript/9.53.3/Resource/Font//usr/share/g.
Can't find (or can't open) font file Arial-BoldMT.
Can't find (or can't open) font file /usr/share/ghostscript/9.53.3/Resource/Font//usr/share/g.
Can't find (or can't open) font file Arial-BoldMT.
Querying operating system for font files...
Can't find (or can't open) font file /usr/share/ghostscript/9.53.3/Resource/Font//usr/share/g.
Can't find (or can't open) font file Arial-BoldMT.
Didn't find this font on the system!
Substituting font Helvetica-Bold for Arial-BoldMT.
Loading NimbusSans-Bold font from /usr/share/ghostscript/9.53.3/Resource/Font/NimbusSans-Bold... 5015192 3517120 2996192 1593499 4 done.
Can't find CID font "Arial".
Attempting to substitute CID font /Adobe-Identity for /Arial, see doc/Use.htm#CIDFontSubstitution.
The substitute CID font "Adobe-Identity" is not provided either. attempting to use fallback CIDFont.See doc/Use.htm#CIDFontSubstitution.
The fallback CID font "CIDFallBack" is not provided. Finally attempting to use ArtifexBullet.
Error reading a content stream. The page may be incomplete.
Output may be incorrect.
Error: File did not complete the page properly and may be damaged.
Output may be incorrect.
GPL Ghostscript 9.53.3: Annotation set to non-printing,
not permitted in PDF/A, annotation will not be present in output file
Apparently some fonts are missing, but I can't figure out which package is required here.
I think it can be solved with font substitution without installing any additional packages. I had to substitute Arial CID font with the already installed liberation font alternative. (simply installing the arial font wasn't enough, and added lot of unnecessary bloat to the image) Ghostscript manual
Here is what I did: I have created a new Dockerfile based on the original:
ROM jonaswinkler/paperless-ng:latest
RUN echo "/Arial << /FileType /TrueType /Path (/usr/share/fonts/truetype/liberation/LiberationSans-Regular.ttf) /SubfontID 0 /CSI [(Identity) 6] >> ;" >> /usr/share/ghostscript/9.53.3/Resource/Init/cidfmap
After that running the same command as you:
root@4eef1a430e6c:/usr/src/paperless/src# ocrmypdf input.pdf output.pdf --skip-text
Scanning contents: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 9.28page/s]
This PDF has a fillable form. Chances are it is a pure digital document that does not need OCR.
Use the option --force-ocr to produce an image of the form and all filled form fields. The output PDF will be 'flattened' and will no longer be fillable.
1 skipping all processing on this page
OCR: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.0/1.0 [00:00<00:00, 9.27page/s]
Postprocessing...
PDF/A conversion: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.56page/s]
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 1.00 savings: 0.0%
Output file is a PDF/A-2B (as expected)
Now I only have problem with the encoding of some characters (eg. Ő). I stil don't exactly know how to fix it, but I think the /CSI [(Identity) 6]
part of the inserted command is the relevant bit here.
Ok, I have experimented with it for a little bit further: for 100% correct encoding you have to install Arial font. The 100% correct Dockerfile:
FROM jonaswinkler/paperless-ng:latest
RUN sed -i'.bak' 's/$/ contrib/' /etc/apt/sources.list
RUN apt-get update; apt-get install -y --no-install-recommends ttf-mscorefonts-installer
RUN echo "/Arial <</FileType /TrueType /Path (/usr/share/fonts/truetype/msttcorefonts/arial.ttf) /SubfontID 0 /CSI [(Identity) 0] >> ;" >> /usr/share/ghostscript/9.53.3/Resource/Init/cidfmap
Describe the bug I have a pdf document which becomes empty after I upload it. I don't see any problem with the original file, and no errors in the log. The pdf holds some sensitive data, i can send it in private message.
To Reproduce
Expected behavior The pdf doesn't become empty.
Screenshots If applicable, add screenshots to help explain your problem.
Webserver logs
Relevant information
docker-compose.yml
,docker-compose.env
orpaperless.conf
.