pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.82k stars 921 forks source link

TypeError: unhashable type: 'list' where processing a pdf file #1039

Open jerryphe88 opened 1 week ago

jerryphe88 commented 1 week ago

TypeError: unhashable type: 'list' where processing a special pdf file:

Sorry I could not provide pdf file here as it is internal doc.

I did live debug, and the call flow info as below (other objid seems fine):

line: 384 in pdfminer/pdfinterp.py self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec) stack value: k = 'Font' fontid = 'F220' objid = 24 resources = {'Font': {'F151': , 'F158': , 'F165': , 'F220': , 'F222': , 'F225': , 'F229': , 'F274': , 'F296': , 'F298': , 'F318': , 'F321': , 'F363': , 'F366': , 'F373': , 'F377': , 'F378': , 'F381': , 'F97': }, 'ProcSet': [/'PDF', /'ImageB', /'ImageC', /'Text'], 'Type': /'Resources', 'XObject': {'I100': , 'I104': , 'I108': , 'I112': , 'I116': , 'I12': , 'I120': , 'I124': , 'I128': , 'I132': , 'I136': , 'I140': , 'I144': , 'I148': , 'I152': , 'I156': , 'I16': , 'I160': , 'I164': , ...}} spec = {'BaseFont': /'3_of_9_Barcode', 'Encoding': [/'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', ...], 'FirstChar': 30, 'FontDescriptor': , 'LastChar': 255, 'Subtype': /'TrueType', 'Type': /'Font', 'Widths': [750, 750, 580, 580, 580, 580, 580, 580, 580, 580, 580, 580, 580, 580, 580, 580, 580, 580, 580, ...]}

==> line: 219 in pdfminer/pdfinterp.py font = PDFTrueTypeFont(self, spec)

==> line: 992: pdfminer/pdffont.py init(rsrcmgr, spec)

==> line: 956: pdfminer/pdffont.py PDFSimpleFont.init(self, descriptor: Mapping[str, Any], widths: FontWidthDict, spec: Mapping[str, Any]) stack value: descriptor = {'Ascent': 750, 'CapHeight': 0, 'Descent': -12, 'Flags': 42, 'FontBBox': [0, -7, 2197, 750], 'FontFile2': , 'FontName': /'3_of_9_Barcode', 'ItalicAngle': 0, 'StemV': 0, 'Type': /'FontDescriptor'} widths = {30: 750, 31: 750, 32: 580, 33: 580, 34: 580, 35: 580, 36: 580, 37: 580, 38: 580, 39: 580, 40: 580, 41: 580, 42: 580, 43: 580, 44: 580, 45: 580, 46: 580, 47: 580, 48: 580, 49: 580, 50: 580, 51: 580, 52: 580, 53: 580, 54: 580, 55: 580, 56: 580, 57: 580, 58: 580, 59: 580, 60: 580, 61: 580, 62: 580, 63: 580, 64: 580, 65: 580, 66: 580, 67: 580, 68: 580, 69: 580, 70: 580, 71: 580, 72: 580, 73: 580, 74: 580, 75: 580, 76: 580, 77: 580, 78: 580, 79: 580, 80: 580, 81: 580, 82: 580, 83: 580, 84: 580, 85: 580, 86: 580, 87: 580, 88: 580, ...} spec = {'BaseFont': /'3_of_9_Barcode', 'Encoding': [/'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', ...], 'FirstChar': 30, 'FontDescriptor': , 'LastChar': 255, 'Subtype': /'TrueType', 'Type': /'Font', 'Widths': [750, 750, 580, 580, 580, 580, 580, 580, 580, 580, 580, 580, 580, 580, 580, 580, 580, 580, 580, ...]}

==> line: 965: pdfminer/pdffont.py stack value: encoding = [/'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'.notdef', /'space', /'exclam', /'universal', /'numbersign', /'existential', /'percent', /'ampersand', /'suchthat', /'parenleft', /'parenright', /'asteriskmath', /'plus', /'comma', /'minus', /'period', /'slash', /'zero', /'one', /'two', /'three', /'four', /'five', /'six', /'seven', /'eight', /'nine', /'colon', ...] the code failed on self.cid2unicode = EncodingDB.get_encoding(literal_name(encoding))

The stack trace is: File "/opt/anaconda3/envs/lc-work/lib/python3.9/site-packages/pdfminer/high_level.py", line 211, in extract_pages interpreter.process_page(page) File "/opt/anaconda3/envs/lc-work/lib/python3.9/site-packages/pdfminer/pdfinterp.py", line 997, in process_page self.render_contents(page.resources, page.contents, ctm=ctm) File "/opt/anaconda3/envs/lc-work/lib/python3.9/site-packages/pdfminer/pdfinterp.py", line 1014, in render_contents self.init_resources(resources) File "/opt/anaconda3/envs/lc-work/lib/python3.9/site-packages/pdfminer/pdfinterp.py", line 384, in init_resources self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec) File "/opt/anaconda3/envs/lc-work/lib/python3.9/site-packages/pdfminer/pdfinterp.py", line 219, in get_font font = PDFTrueTypeFont(self, spec) File "/opt/anaconda3/envs/lc-work/lib/python3.9/site-packages/pdfminer/pdffont.py", line 1010, in init data = self.fontfile.get_data()[:length1] File "/opt/anaconda3/envs/lc-work/lib/python3.9/site-packages/pdfminer/pdffont.py", line 969, in init self.unicode_map = FileUnicodeMap() File "/opt/anaconda3/envs/lc-work/lib/python3.9/site-packages/pdfminer/encodingdb.py", line 113, in get_encoding if diff: TypeError: unhashable type: 'list'

dhdaines commented 13 hours ago

Hmm. According to the PDF spec:

A Type 1 font’s built-in encoding shall be defined by an Encoding array that is part of the font program, not to be confused with the Encoding entry in the PDF font dictionary.

Either pdfminer has gotten the PDF font dictionary and the font program confused, or whatever piece of software created the PDF did that, because an Encoding entry in the font dictionary can only be a name or a dictionary, whereas a Type 1 font's Encoding array looks exactly like what you've got in the log (it's full of ".notdef"). Since the log you've provided is just reporting what's in the file itself, I'm inclined to think that it's the PDF software's fault (especially since it claims that this is a TrueType font!).

But of course pdfminer should be robust to these sorts of shenanigans. What software created the PDF?