ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
13.69k stars 998 forks source link

[Bug]: Output file is okay but is not PDF/A #1372

Closed tcurdt closed 1 month ago

tcurdt commented 1 month ago

Describe the bug

I have read through https://github.com/ocrmypdf/OCRmyPDF/issues/490 but I still don't quite understand the message.

Since the target format is PDF/A (the default) - why does it not turn it into a PDF/A? What is preventing that?

Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
    1 skipping all processing on this page
OCR                   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Postprocessing...
PDF/A conversion      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Linearizing           ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 100/100 0:00:00
Recompressing JPEGs   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Deflating JPEGs       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
JBIG2                 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
Image optimization ratio: 1.00 savings: 0.0%
Total file size ratio: 0.87 savings: -15.2%
Output file is okay but is not PDF/A (seems to be No PDF/A metadata in XMP)

It would be really nice to adjust the warning message to give a little more context.

It seems the reason shows with -v1:

Detected SMask which must be in DeviceGray, but we are not converting to DeviceGray, reverting to normal PDF output

But what is a "SMask" and why does it need to be "DeviceGray"? And what does "No PDF/A metadata in XMP" expect to find?

Steps to reproduce

1. `ocrmypdf -s -l deu+eng in.pdf out.pdf`

Files

No response

How did you download and install the software?

Homebrew

OCRmyPDF version

16.4.3

Relevant log output

ocrmypdf 16.4.3                                                                                                                                  __main__.py:59
Running: ['tesseract', '--version']                                                                                                             __init__.py:133
Found tesseract 5.4.1                                                                                                                           __init__.py:343
Running: ['tesseract', '--version']                                                                                                             __init__.py:133
Running: ['tesseract', '--version']                                                                                                             __init__.py:133
Running: ['gs', '--version']                                                                                                                    __init__.py:133
Found gs 10.3.1                                                                                                                                 __init__.py:343
Running: ['gs', '--version']                                                                                                                    __init__.py:133
Running: ['tesseract', '--list-langs']                                                                                                          __init__.py:133
stdout/stderr = List of available languages in "/opt/homebrew/share/tessdata/" (163):                                                            __init__.py:73
afr                                                                                                                                                            
amh                                                                                                                                                            
ara                                                                                                                                                            
asm                                                                                                                                                            
aze                                                                                                                                                            
aze_cyrl                                                                                                                                                       
bel                                                                                                                                                            
ben                                                                                                                                                            
bod                                                                                                                                                            
bos                                                                                                                                                            
bre                                                                                                                                                            
bul                                                                                                                                                            
cat                                                                                                                                                            
ceb                                                                                                                                                            
ces                                                                                                                                                            
chi_sim                                                                                                                                                        
chi_sim_vert                                                                                                                                                   
chi_tra                                                                                                                                                        
chi_tra_vert                                                                                                                                                   
chr                                                                                                                                                            
cos                                                                                                                                                            
cym                                                                                                                                                            
dan                                                                                                                                                            
deu                                                                                                                                                            
div                                                                                                                                                            
dzo                                                                                                                                                            
ell                                                                                                                                                            
eng                                                                                                                                                            
enm                                                                                                                                                            
epo                                                                                                                                                            
equ                                                                                                                                                            
est                                                                                                                                                            
eus                                                                                                                                                            
fao                                                                                                                                                            
fas                                                                                                                                                            
fil                                                                                                                                                            
fin                                                                                                                                                            
fra                                                                                                                                                            
frk                                                                                                                                                            
frm                                                                                                                                                            
fry                                                                                                                                                            
gla                                                                                                                                                            
gle                                                                                                                                                            
glg                                                                                                                                                            
grc                                                                                                                                                            
guj                                                                                                                                                            
hat                                                                                                                                                            
heb                                                                                                                                                            
hin                                                                                                                                                            
hrv                                                                                                                                                            
hun                                                                                                                                                            
hye                                                                                                                                                            
iku                                                                                                                                                            
ind                                                                                                                                                            
isl                                                                                                                                                            
ita                                                                                                                                                            
ita_old                                                                                                                                                        
jav                                                                                                                                                            
jpn                                                                                                                                                            
jpn_vert                                                                                                                                                       
kan                                                                                                                                                            
kat                                                                                                                                                            
kat_old                                                                                                                                                        
kaz                                                                                                                                                            
khm                                                                                                                                                            
kir                                                                                                                                                            
kmr                                                                                                                                                            
kor                                                                                                                                                            
kor_vert                                                                                                                                                       
lao                                                                                                                                                            
lat                                                                                                                                                            
lav                                                                                                                                                            
lit                                                                                                                                                            
ltz                                                                                                                                                            
mal                                                                                                                                                            
mar                                                                                                                                                            
mkd                                                                                                                                                            
mlt                                                                                                                                                            
mon                                                                                                                                                            
mri                                                                                                                                                            
msa                                                                                                                                                            
mya                                                                                                                                                            
nep                                                                                                                                                            
nld                                                                                                                                                            
nor                                                                                                                                                            
oci                                                                                                                                                            
ori                                                                                                                                                            
osd                                                                                                                                                            
pan                                                                                                                                                            
pol                                                                                                                                                            
por                                                                                                                                                            
pus                                                                                                                                                            
que                                                                                                                                                            
ron                                                                                                                                                            
rus                                                                                                                                                            
san                                                                                                                                                            
script/Arabic                                                                                                                                                  
script/Armenian                                                                                                                                                
script/Bengali                                                                                                                                                 
script/Canadian_Aboriginal                                                                                                                                     
script/Cherokee                                                                                                                                                
script/Cyrillic                                                                                                                                                
script/Devanagari                                                                                                                                              
script/Ethiopic                                                                                                                                                
script/Fraktur                                                                                                                                                 
script/Georgian                                                                                                                                                
script/Greek                                                                                                                                                   
script/Gujarati                                                                                                                                                
script/Gurmukhi                                                                                                                                                
script/HanS                                                                                                                                                    
script/HanS_vert                                                                                                                                               
script/HanT                                                                                                                                                    
script/HanT_vert                                                                                                                                               
script/Hangul                                                                                                                                                  
script/Hangul_vert                                                                                                                                             
script/Hebrew                                                                                                                                                  
script/Japanese                                                                                                                                                
script/Japanese_vert                                                                                                                                           
script/Kannada                                                                                                                                                 
script/Khmer                                                                                                                                                   
script/Lao                                                                                                                                                     
script/Latin                                                                                                                                                   
script/Malayalam                                                                                                                                               
script/Myanmar                                                                                                                                                 
script/Oriya                                                                                                                                                   
script/Sinhala                                                                                                                                                 
script/Syriac                                                                                                                                                  
script/Tamil                                                                                                                                                   
script/Telugu                                                                                                                                                  
script/Thaana                                                                                                                                                  
script/Thai                                                                                                                                                    
script/Tibetan                                                                                                                                                 
script/Vietnamese                                                                                                                                              
sin                                                                                                                                                            
slk                                                                                                                                                            
slv                                                                                                                                                            
snd                                                                                                                                                            
snum                                                                                                                                                           
spa                                                                                                                                                            
spa_old                                                                                                                                                        
sqi                                                                                                                                                            
srp                                                                                                                                                            
srp_latn                                                                                                                                                       
sun                                                                                                                                                            
swa                                                                                                                                                            
swe                                                                                                                                                            
syr                                                                                                                                                            
tam                                                                                                                                                            
tat                                                                                                                                                            
tel                                                                                                                                                            
tgk                                                                                                                                                            
tha                                                                                                                                                            
tir                                                                                                                                                            
ton                                                                                                                                                            
tur                                                                                                                                                            
uig                                                                                                                                                            
ukr                                                                                                                                                            
urd                                                                                                                                                            
uzb                                                                                                                                                            
uzb_cyrl                                                                                                                                                       
vie                                                                                                                                                            
yid                                                                                                                                                            
yor                                                                                                                                                            

pikepdf mmap enabled                                                                                                                             helpers.py:328
os.symlink(2023-11-30 rechnung_104.pdf, /var/folders/2j/qlgsy9ys335cdkc03vd8byfr0000gn/T/ocrmypdf.io.u37txu66/origin)                            helpers.py:179
os.symlink(/var/folders/2j/qlgsy9ys335cdkc03vd8byfr0000gn/T/ocrmypdf.io.u37txu66/origin,                                                         helpers.py:179
/var/folders/2j/qlgsy9ys335cdkc03vd8byfr0000gn/T/ocrmypdf.io.u37txu66/origin.pdf)                                                                              
Gathering info with 1 thread workers                                                                                                                info.py:800
pikepdf mmap enabled                                                                                                                             helpers.py:328
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Using Tesseract OpenMP thread limit 3                                                                                                      tesseract_ocr.py:199
pikepdf mmap enabled                                                                                                                             helpers.py:328
    1 skipping all processing on this page                                                                                                     _pipeline.py:330
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0                                                            _graft.py:140
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0                                                                                        _graft.py:165
OCR                   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Postprocessing...                                                                                                                                    ocr.py:144
os.symlink(/var/folders/2j/qlgsy9ys335cdkc03vd8byfr0000gn/T/ocrmypdf.io.u37txu66/graft_layers.pdf,                                               helpers.py:179
/var/folders/2j/qlgsy9ys335cdkc03vd8byfr0000gn/T/ocrmypdf.io.u37txu66/fix_docinfo.pdf)                                                                         
Running: ['gs', '--version']                                                                                                                    __init__.py:133
Running: ['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None',                  __init__.py:133
'-sColorConversionStrategy=LeaveColorUnchanged', '-dPDFSTOPONERROR', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true',                            
'-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr',                                                                        
'/var/folders/2j/qlgsy9ys335cdkc03vd8byfr0000gn/T/ocrmypdf.io.u37txu66/pdfa.ps',                                                                               
'/var/folders/2j/qlgsy9ys335cdkc03vd8byfr0000gn/T/ocrmypdf.io.u37txu66/fix_docinfo.pdf']                                                                       
GPL Ghostscript 10.03.1 (2024-05-02)                                                                                                            __init__.py:108
Copyright (C) 2024 Artifex Software, Inc.  All rights reserved.                                                                                 __init__.py:108
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:                                                                      __init__.py:108
see the file COPYING for details.                                                                                                               __init__.py:108
Processing pages 1 through 1.                                                                                                                   __init__.py:108
Page 1                                                                                                                                          __init__.py:108
GPL Ghostscript 10.03.1:                                                                                                                        __init__.py:108
Detected SMask which must be in DeviceGray, but we are not converting to DeviceGray, reverting to normal PDF output                             __init__.py:108
PDF/A conversion      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Running: ['tesseract', '--version']                                                                                                             __init__.py:133
Linearizing           ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 100/100 0:00:00
xref 24: treating as an optimization candidate                                                                                                  optimize.py:282
xref 24: skipping image because it is an SMask                                                                                                  optimize.py:280
xref 25: treating as an optimization candidate                                                                                                  optimize.py:282
XrefExt(xref=25, ext='.png')                                                                                                                    optimize.py:347
Optimizable images: JPEGs: 0 PNGs: 1                                                                                                            optimize.py:352
Recompressing JPEGs   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
xref 24: treating as an optimization candidate                                                                                                  optimize.py:282
xref 24: skipping image because it is an SMask                                                                                                  optimize.py:280
xref 25: treating as an optimization candidate                                                                                                  optimize.py:282
Deflating JPEGs       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
xref 24: treating as an optimization candidate                                                                                                  optimize.py:282
xref 24: skipping image because it is an SMask                                                                                                  optimize.py:280
xref 25: treating as an optimization candidate                                                                                                  optimize.py:282
Optimizable images: JBIG2 groups: 0                                                                                                             optimize.py:363
JBIG2                 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
os.symlink(/var/folders/2j/qlgsy9ys335cdkc03vd8byfr0000gn/T/ocrmypdf.io.u37txu66/optimize.opt.pdf,                                               helpers.py:179
/var/folders/2j/qlgsy9ys335cdkc03vd8byfr0000gn/T/ocrmypdf.io.u37txu66/optimize.pdf)                                                                            
Running: ['jbig2', '--version']                                                                                                                 __init__.py:133
Running: ['pngquant', '--version']                                                                                                              __init__.py:133
Image optimization ratio: 1.00 savings: 0.0%                                                                                                   _pipeline.py:989
Total file size ratio: 0.87 savings: -15.2%                                                                                                    _pipeline.py:992
/var/folders/2j/qlgsy9ys335cdkc03vd8byfr0000gn/T/ocrmypdf.io.u37txu66/optimize.pdf -> ocred/2023-11-30 rechnung_104.pdf                       _pipeline.py:1064
Output file is okay but is not PDF/A (seems to be No PDF/A metadata in XMP)                                                                      _common.py:443
jbarlow83 commented 1 month ago

The default behavior is to use Ghostscript to attempt PDF/A conversion. Sometimes, Ghostscript fails to produce a PDF/A and reverts to regular PDF instead. When OCRmyPDF notices this, is reports "seems to be No PDF/A metadata in XMP", that is, the file produced by Ghostscript does not have PDF/A metadata markers, even though we asked for this. Ghostscript is not always good at describing why it failed to produce PDF/A - suffice to say, the input PDF has some features that prevent PDF/A conversion, as far as Ghostscript is concerned.

As usual, no file means I can't give any specifics, make any recommendations, or fix any bugs. It's a bit like complaining that your web browser failed to render a web page, but you can't provide me a URI or even a screenshot. I'm very tempted to implement a policy of closing such issues without comment. If you're not willing to share information that is essential to fixing an issue, why bother reporting it?

tcurdt commented 1 month ago

I read the frustration from your reply. Sorry about that. I wish I could share the file but it has too much privacy relevant information. So a debug log was the next best thing.

Given that there are errors/improvements that might be unrelated to input files it would be a shame to restrict issues to have input files.

The way I read it, this really is a Ghostscript problem. I would have to ask them what Detected SMask which must be in DeviceGray, but we are not converting to DeviceGray, reverting to normal PDF output means. And there isn't really much OCRmyPDF can do about it anyway. Correct?

Maybe it could be good idea to change the messaging a bit. Instead of

Output file is okay but is not PDF/A (seems to be No PDF/A metadata in XMP)

maybe something along the lines of

It's a valid PDF but Ghostscript failed to convert it into a PDF/A

and maybe even list the error outside of -v1.

jbarlow83 commented 1 month ago

I will improve the error message for the next release.