Open caolf opened 1 year ago
@jsvine looking forward to your help! Thanks
Hi @caolf
Just thought I'd add some info:
This seems to have come up before #316
Although it seems in this case, the exception is coming from the underlying pdfminer
library:
/Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11
/lib/python3.11/site-packages/pdfminer/pdftypes.py:396:
^^^^^^^^
https://github.com/pdfminer/pdfminer.six/issues/495 seems to be the same bug.
Thanks for reporting @caolf Request you to please share the PDF that has the issue too.
@samkit-jain I'm sorry, this is an internal document and cannot be made public!
@caolf Okay, see if you can redact the sensitive information and make it ready to attach here. If not, without it, it will be a bit difficult to properly debug and fix (if pdfplumber issue).
Hi @samkit-jain
There is an example PDF from https://github.com/pdfminer/pdfminer.six/issues/495#issuecomment-1594322455 which raises the same exception if you're interested:
https://github.com/pdfminer/pdfminer.six/files/11768084/pdfminer_testpart.pdf
I don't really know anything about PDF internals, but the issue seems to be the PDFObjRef
object is ending up in DecodeParms
when it shouldn't?
https://github.com/pdfminer/pdfminer.six/blob/master/pdfminer/pdfparser.py#L88
dic={'Length': 4065, 'Length1': 8964, 'Filter': /'FlateDecode', 'DecodeParms': <PDFObjRef:49>}
Thanks for the PDF @cmdlineluser I'll see if there's something that we can do
Describe the bug
raise TypeError: argument of type 'PDFObjRef' is not iterable when exec extract_tables(table_settings=table_settings) for page 3 , but page 1 or page 2 is ok
Code to reproduce the problem
PDF file
Please attach any PDFs necessary to reproduce the problem.
If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.
Screenshots
pdfplumberlib.py:293:
/Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfplumber/page.py:300: in extract_tables tables = self.find_tables(tset) /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfplumber/page.py:294: in find_tables return TableFinder(self, tset).tables /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfplumber/table.py:570: in init self.edges = self.get_edges() /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfplumber/table.py:600: in get_edges words = self.page.extract_words((settings.text_settings or {})) /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfplumber/page.py:356: in extract_words return utils.extract_words(self.chars, kwargs) /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfplumber/container.py:50: in chars return self.objects.get("char", []) /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfplumber/page.py:215: in objects self._objects: Dict[str, T_obj_list] = self.parse_objects() /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfplumber/page.py:275: in parse_objects for obj in self.iter_layout_objects(self.layout._objs): /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfplumber/page.py:161: in layout interpreter.process_page(self.page_obj) /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfminer/pdfinterp.py:997: in process_page self.render_contents(page.resources, page.contents, ctm=ctm) /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfminer/pdfinterp.py:1014: in render_contents self.init_resources(resources) /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfminer/pdfinterp.py:384: in init_resources self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec) /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfminer/pdfinterp.py:234: in get_font font = self.get_font(None, subspec) /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfminer/pdfinterp.py:225: in get_font font = PDFCIDFont(self, spec) /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfminer/pdffont.py:1072: in init ttf = TrueTypeFont(self.basefont, BytesIO(self.fontfile.get_data())) /Users/caolf/Library/Caches/pypoetry/virtualenvs/python-demo-8KG8_SfQ-py3.11/lib/python3.11/site-packages/pdfminer/pdftypes.py:396: in get_data self.decode()
self = <PDFStream(119): raw=64251, {'Length': 64251, 'Filter': /'FlateDecode', 'DecodeParms':, 'Length1': 214528}>
Environment
looking forward to your help! Thanks