ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
13.91k stars 1.01k forks source link

[Bug]: pdfminer.pdfexceptions.PDFTypeError: invalid length: 6 #1361

Closed user1823 closed 2 months ago

user1823 commented 2 months ago

Describe the bug

OCR failed to complete.

Steps to reproduce

1. Run ocrmypdf --output-type pdf --max-image-mpixels 1000 --tesseract-downsample-above 3508 --redo-ocr in.pdf out.pdf
2. See error.

Files

Let me know if you need the file (if the issue is not clear from the error message)

How did you download and install the software?

PyPI (pip, poetry, pipx, etc.)

OCRmyPDF version

16.4.2

Relevant log output

Scanning contents     ━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━  22%  52/239 0:00:16
An exception occurred while executing the pipeline                _common.py:284
Traceback (most recent call last):                                              
  File                                                                          
"/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.1               
2/site-packages/ocrmypdf/_pipelines/_common.py", line 249, in                   
cli_exception_handler                                                           
    return fn(options, plugin_manager)                                          
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^                                          
  File                                                                          
"/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.1               
2/site-packages/ocrmypdf/_pipelines/ocr.py", line 174, in                       
_run_pipeline                                                                   
    pdfinfo = get_pdfinfo(                                                      
              ^^^^^^^^^^^^                                                      
  File                                                                          
"/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.1               
2/site-packages/ocrmypdf/_pipeline.py", line 186, in get_pdfinfo                
    return PdfInfo(                                                             
           ^^^^^^^^                                                             
  File                                                                          
"/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.1               
2/site-packages/ocrmypdf/pdfinfo/info.py", line 1133, in __init__               
    self._pages = _pdf_pageinfo_concurrent(                                     
                  ^^^^^^^^^^^^^^^^^^^^^^^^^                                     
  File                                                                          
"/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.1               
2/site-packages/ocrmypdf/pdfinfo/info.py", line 793, in                         
_pdf_pageinfo_concurrent                                                        
    executor(                                                                   
  File                                                                          
"/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.1               
2/site-packages/ocrmypdf/_concurrent.py", line 78, in __call__                  
    self._execute(                                                              
  File                                                                          
"/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.1               
2/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line                  
144, in _execute                                                                
    result = future.result()                                                    
             ^^^^^^^^^^^^^^^                                                    
  File                                                                          
"/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.1               
2/concurrent/futures/_base.py", line 449, in result                             
    return self.__get_result()                                                  
           ^^^^^^^^^^^^^^^^^^^                                                  
  File                                                                          
"/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.1               
2/concurrent/futures/_base.py", line 401, in __get_result                       
    raise self._exception                                                       
  File                                                                          
"/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.1               
2/concurrent/futures/thread.py", line 58, in run                                
    result = self.fn(*self.args, **self.kwargs)                                 
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                 
  File                                                                          
"/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.1               
2/site-packages/ocrmypdf/pdfinfo/info.py", line 742, in                         
_pdf_pageinfo_sync                                                              
    return PageInfo(pdf, pageno, infile, check_pages,                           
detailed_analysis)                                                              
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^               
^^^^^^^                                                                         
  File                                                                          
"/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.1               
2/site-packages/ocrmypdf/pdfinfo/info.py", line 857, in __init__                
    self._gather_pageinfo(pdf, pageno, infile, check_pages,                     
detailed_analysis)                                                              
  File                                                                          
"/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.1               
2/site-packages/ocrmypdf/pdfinfo/info.py", line 882, in                         
_gather_pageinfo                                                                
    miner = get_page_analysis(infile, pageno, pscript5_mode)                    
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                    
  File                                                                          
"/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.1               
2/site-packages/ocrmypdf/pdfinfo/layout.py", line 313, in                       
get_page_analysis                                                               
    interp.process_page(page)                                                   
  File                                                                          
"/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.1               
2/site-packages/pdfminer/pdfinterp.py", line 997, in process_page               
    self.render_contents(page.resources, page.contents, ctm=ctm)                
  File                                                                          
"/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.1               
2/site-packages/pdfminer/pdfinterp.py", line 1014, in                           
render_contents                                                                 
    self.init_resources(resources)                                              
  File                                                                          
"/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.1               
2/site-packages/pdfminer/pdfinterp.py", line 384, in                            
init_resources                                                                  
    self.fontmap = self.rsrcmgr.get_font(objid, spec)                           
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                   
  File                                                                          
"/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.1               
2/site-packages/pdfminer/pdfinterp.py", line 234, in get_font                   
    font = self.get_font(None, subspec)                                         
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                         
  File                                                                          
"/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.1               
2/site-packages/pdfminer/pdfinterp.py", line 225, in get_font                   
    font = PDFCIDFont(self, spec)                                               
           ^^^^^^^^^^^^^^^^^^^^^^                                               
  File                                                                          
"/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.1               
2/site-packages/pdfminer/pdffont.py", line 1084, in __init__                    
    CMapParser(self.unicode_map, BytesIO(strm.get_data())).run()                
  File                                                                          
"/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.1               
2/site-packages/pdfminer/cmapdb.py", line 299, in run                           
    self.nextobject()                                                           
  File                                                                          
"/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.1               
2/site-packages/pdfminer/psparser.py", line 648, in nextobject                  
    self.do_keyword(pos, token)                                                 
  File                                                                          
"/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.1               
2/site-packages/pdfminer/cmapdb.py", line 459, in do_keyword                    
    self.cmap.add_cid2unichr(nunpack(cid), code)                                
                             ^^^^^^^^^^^^                                       
  File                                                                          
"/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.1               
2/site-packages/pdfminer/utils.py", line 365, in nunpack                        
    raise PDFTypeError("invalid length: %d" % length)                           
pdfminer.pdfexceptions.PDFTypeError: invalid length: 6
jbarlow83 commented 2 months ago

Probably corrupt font, but will need test file.

user1823 commented 2 months ago

Test file: in.pdf

user1823 commented 2 months ago

If I rewrite this file using GhostScript (with the below command) and then use ocrmypdf, the issue disappears.

gswin64.exe -sDEVICE=pdfwrite -dBATCH -dNOPAUSE -sOutputFile=gs.pdf in.pdf

But still, the quality of the OCR is very poor. OCRmyPDF barely changes any of the original (incorrect) text when using the --redo-ocr option.

I see this warning/advise in the Terminal:

1 some text on this page cannot be mapped to characters: consider using --force-ocr instead

Now, if I use --force-ocr, the quality of the OCR is drastically better but the file size increases by 15%. So, is using --force-ocr the only way to OCR this file? Or is there any other hack available that I can use or you can add to OCRmyPDF?

Or is the issue (of --redo-ocr not being useful) caused by GhostScript? Is the OCRed text accurate when the current (unreleased) version of OCRmyPDF is used on the original file?

jbarlow83 commented 2 months ago

The issue was with pdfminer interpreting the Unicode mapping data. If Ghostscript rewrote it, it could have worked around the issue. Even a one byte adjustment could have been a workaround.

--redo-ocr has some limitations - there's no standard way of encoding OCR or marking text as OCR, so it can't detect all cases.

--force-ocr is the best option for this file.

jbarlow83 commented 2 months ago

Regarding this

1 some text on this page cannot be mapped to characters: consider using --force-ocr instead

That means the mapping to Unicode is incomplete - this can cause characters to appear correctly when selected, but they will copy-paste as gibberish, and also the behavior will vary based on the PDF viewer since some try heuristics to detect the text encoding. That's why it's best to throw out everything and force OCR for this file.

user1823 commented 2 months ago

this can cause characters to appear correctly when selected, but they will copy-paste as gibberish

Is it not possible for OCRmyPDF to correct the mapping of the characters based on the characters detected by OCR?

To clarify, my question is not whether OCRmyPDF is currently able to correct the mapping (which I assume it can't). My question is whether OCRmyPDF can be modified to be able to correct the mapping.

The main reasons for which I don't want to use --force-ocr include

If there is a way around, I would really like to avoid using --force-ocr.

jbarlow83 commented 2 months ago

Is it not possible for OCRmyPDF to correct the mapping of the characters based on the characters detected by OCR?

Possible but hard. That's pretty major surgery and the results from doing something like force-ocr are often better. Ghostscript recently added a mode that attempts to fix broken font mappings (whether the font is OCR-derived or some other origin).

jbarlow83 commented 2 months ago

You can avoid lossy recompression using --output-type pdf and --optimize 1. The images do get rendered, but at a higher DPI than their source, so this is safe in almost cases.

user1823 commented 2 months ago

Possible but hard. That's pretty major surgery

I would really appreciate if such a feature is eventually added to OCRmyPDF (because you said that it's hard, I don't expect it anytime soon).

Ghostscript recently added a mode that attempts to fix broken font mappings

Can you please tell how to activate that mode?

You can avoid lossy recompression using --output-type pdf and --optimize 1.

Isn't --optimize 1 the default?