pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.82k stars 921 forks source link

Crash in PDFSimpleFont.__init__ (& monkey patch workaround) #510

Open eoinof opened 3 years ago

eoinof commented 3 years ago

Bug report

I'm seeing a crash in the latest release of pdfminer.six (20200726) with certain PDF files. Unfortunately for privacy reasons I can't share these.

The crash is caused because the 'encoding' variable in pdffont.PDFSimpleFont.init

is a list, as opposed to either a dict or a string

This is the value of 'encoding' that triggers the crash

encoding = [
/'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'space', /'exclam', /'universal', /'numbersign', /'existential', /'percent', /'ampersand', /'suchthat', /'parenleft', /'parenright', /'asteriskmath', /'plus', /'comma', /'minus', /'period', /'slash', /'zero', /'one', /'two', /'three', /'four', /'five', /'six', /'seven', /'eight', /'nine', /'colon', /'semicolon', /'less', /'equal', /'greater', /'question', /'congruent', /'Alpha', /'Beta', /'Chi', /'Delta', /'Epsilon', /'Phi', /'Gamma', /'Eta', /'Iota', /'theta1', /'Kappa', /'Lambda', /'Mu', /'Nu', /'Omicron', /'Pi', /'Theta', /'Rho', /'Sigma', /'Tau', /'Upsilon', /'sigma1', /'Omega', /'Xi', /'Psi', /'Zeta', /'bracketleft', /'therefore', /'bracketright', /'perpendicular', /'underscore', /'radicalex', /'alpha', /'beta', /'chi', /'delta', /'epsilon', /'phi', /'gamma', /'eta', /'iota', /'phi1', /'kappa', /'lambda', /'mu', /'nu', /'omicron', /'pi', /'theta', /'rho', /'sigma', /'tau', /'upsilon', /'omega1', /'omega', /'xi', /'psi', /'zeta', /'braceleft', /'bar', /'braceright', /'similar', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'Euro', /'Upsilon1', /'minute', /'lessequal', /'fraction', /'infinity', /'florin', /'club', /'diamond', /'heart', /'spade', /'arrowboth', /'arrowleft', /'arrowup', /'arrowright', /'arrowdown', /'degree', /'plusminus', /'second', /'greaterequal', /'multiply', /'proportional', /'partialdiff', /'bullet', /'divide', /'notequal', /'equivalence', /'approxequal', /'ellipsis', /'arrowvertex', /'arrowhorizex', /'carriagereturn', /'aleph', /'Ifraktur', /'Rfraktur', /'weierstrass', /'circlemultiply', /'circleplus', /'emptyset', /'intersection', /'union', /'propersuperset', /'reflexsuperset', /'notsubset', /'propersubset', /'reflexsubset', /'element', /'notelement', /'angle', /'gradient', /'registerserif', /'copyrightserif', /'trademarkserif', /'product', /'radical', /'dotmath', /'logicalnot', /'logicaland', /'logicalor', /'arrowdblboth', /'arrowdblleft', /'arrowdblup', /'arrowdblright', /'arrowdbldown', /'lozenge', /'angleleft', /'registersans', /'copyrightsans', /'trademarksans', /'summation', /'parenlefttp', /'parenleftex', /'parenleftbt', /'bracketlefttp', /'bracketleftex', /'bracketleftbt', /'bracelefttp', /'braceleftmid', /'braceleftbt', /'braceex', /'.notdef', /'angleright', /'integral', /'integraltp', /'integralex', /'integralbt', /'parenrighttp', /'parenrightex', /'parenrightbt', /'bracketrighttp', /'bracketrightex', /'bracketrightbt', /'bracerighttp', /'bracerightmid', /'bracerightbt'
]

Stacktrace:

  File "/***/lib/python3.6/site-packages/pdfminer/pdfinterp.py", line 895, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/***/lib/python3.6/site-packages/pdfminer/pdfinterp.py", line 906, in render_contents
    self.init_resources(resources)
  File "/***/lib/python3.6/site-packages/pdfminer/pdfinterp.py", line 354, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
  File "/***/lib/python3.6/site-packages/pdfminer/pdfinterp.py", line 187, in get_font
    font = PDFTrueTypeFont(self, spec)
  File "/***/lib/python3.6/site-packages/pdfminer/pdffont.py", line 615, in __init__
    PDFSimpleFont.__init__(self, descriptor, widths, spec)
  File "/***/lib/python3.6/site-packages/pdfminer/pdffont.py", line 577, in __init__
    self.cid2unicode = EncodingDB.get_encoding(literal_name(encoding))
  File "/***/lib/python3.6/site-packages/pdfminer/encodingdb.py", line 99, in get_encoding
    cid2unicode = cls.encodings.get(name, cls.std2unicode)
TypeError: unhashable type: 'list'

I understand that the root cause of the issue is an incorrectly generated encoding in the spec variable, but for our purposes simply ignoring the list value is a satisfactory, if inelegant solution.. I'll update this thread once we have some time to spend understanding the root cause..

I've included an example monkey patch in case anyone else needs to resolve the issue without much effort.. Example Monkey Patch

from pdfminer.pdffont import PDFSimpleFont

#Some PDF files have corrupted/invalid encodings, which we ignore, but they crash the current pdfFont code
# so we are monkey patching it to keep things going..

original_init = PDFSimpleFont.__init__
## Monkey Patch Function
def simpleFontEncodingAsListIgnored__init__(self, descriptor, widths, spec):
    if 'Encoding' in spec and isinstance(spec['Encoding'], list):
        pass
    else:
        original_init(self, descriptor, widths, spec)

## Replace __init__ function with our patched version..
PDFSimpleFont.__init__ = simpleFontEncodingAsListIgnored__init__
pietermarsman commented 3 years ago

Hi @eoinof, thanks for sharing the error and a fix. The encoding variable does contain a list of character names. Could you read section 5.5.5 of the PDF Reference and see if you can figure out what type of font you have, and if and how your pdf deviates from the reference?

If you don't have the time, can you close this issue, because there is not much we can do without the actual PDF sample.

eoinof commented 3 years ago

Hi @pietermarsman

Yes, I'm just completing a Python 3 update on a codebase.. but I plan to figure out the cause once that is done.

Eoin