jcushman / pdfquery

A fast and friendly PDF scraping library.
MIT License
770 stars 89 forks source link

'PDFObjRef' object does not support indexing #58

Open travis-st opened 7 years ago

travis-st commented 7 years ago

`import pdfquery import sys

pdf = pdfquery.PDFQuery(sys.argv[1]) pdf.load()`

Traceback (most recent call last): File "bin/parse_pdf.py", line 6, in <module> pdf.load() File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 385, in load self.tree = self.get_tree(*_flatten(page_numbers)) File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 487, in get_tree for n, page in pages: File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 608, in <genexpr> return (self.get_layout(page) for page in self._cached_pages()) File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 603, in get_layout layout = self._add_annots(layout, page.annots) File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 647, in _add_annots annot = self._set_hwxy_attrs(annot) File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 665, in _set_hwxy_attrs attr['x0'] = bbox[0] TypeError: 'PDFObjRef' object does not support indexing

jcushman commented 7 years ago

Hi! I can't really debug this without the PDF that's causing a problem for you -- can you share it?

travis-st commented 7 years ago

I would love to, but it's proprietary and confidential. Sorry :(

travis-st commented 7 years ago

FYI, experienced a different problem this time:

>>> pdf = pdfquery.PDFQuery("input/2015/12-Dec/17-Dec/17-12.pdf")
>>> pdf.load()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 385, in load
    self.tree = self.get_tree(*_flatten(page_numbers))
  File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 486, in get_tree
    pages = enumerate(self.get_layouts())
  File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 608, in get_layouts
    return (self.get_layout(page) for page in self._cached_pages())
  File "/usr/local/lib/python2.7/site-packages/pdfquery/pdfquery.py", line 636, in _cached_pages
    self._pages += list(self._pages_iter)
  File "/usr/local/lib/python2.7/site-packages/pdfminer/pdfpage.py", line 100, in create_pages
    yield klass(document, objid, tree)
  File "/usr/local/lib/python2.7/site-packages/pdfminer/pdfpage.py", line 53, in __init__
    self.mediabox = resolve1(self.attrs['MediaBox'])
KeyError: 'MediaBox'
acmisiti commented 6 years ago

Any update on the "'PDFObjRef' object does not support indexing" issue?

NickHeiner commented 6 years ago

I experienced this same issue, and also cannot share the PDF being used unfortunately.

kravchenkog commented 5 years ago

I have a similar problem.

>>> pdf.load() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/anaconda3/lib/python3.7/site-packages/pdfquery/pdfquery.py", line 385, in load self.tree = self.get_tree(*_flatten(page_numbers)) File "/anaconda3/lib/python3.7/site-packages/pdfquery/pdfquery.py", line 487, in get_tree for n, page in pages: File "/anaconda3/lib/python3.7/site-packages/pdfquery/pdfquery.py", line 608, in <genexpr> return (self.get_layout(page) for page in self._cached_pages()) File "/anaconda3/lib/python3.7/site-packages/pdfquery/pdfquery.py", line 601, in get_layout self.interpreter.process_page(page) File "/anaconda3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 852, in process_page self.render_contents(page.resources, page.contents, ctm=ctm) File "/anaconda3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 864, in render_contents self.execute(list_value(streams)) File "/anaconda3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 888, in execute func(*args) File "/anaconda3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 772, in do_TJ self.device.render_string(self.textstate, seq, self.ncs, self.graphicstate.copy()) File "/anaconda3/lib/python3.7/site-packages/pdfminer/pdfdevice.py", line 87, in render_string scaling, charspace, wordspace, rise, dxscale, ncs, graphicstate) File "/anaconda3/lib/python3.7/site-packages/pdfminer/pdfdevice.py", line 105, in render_string_horizontal ncs, graphicstate) File "/anaconda3/lib/python3.7/site-packages/pdfminer/converter.py", line 121, in render_char textwidth = font.char_width(cid) File "/anaconda3/lib/python3.7/site-packages/pdfminer/pdffont.py", line 525, in char_width return self.widths[cid] * self.hscale TypeError: unsupported operand type(s) for *: 'PDFObjRef' and 'float'

NickB23 commented 5 years ago

Here too. Same problem:


  File "pdfqueryparser.py", line 4, in <module>
    pdf.load()
  File "/usr/local/lib/python3.7/site-packages/pdfquery/pdfquery.py", line 385, in load
    self.tree = self.get_tree(*_flatten(page_numbers))
  File "/usr/local/lib/python3.7/site-packages/pdfquery/pdfquery.py", line 487, in get_tree
    for n, page in pages:
  File "/usr/local/lib/python3.7/site-packages/pdfquery/pdfquery.py", line 608, in <genexpr>
    return (self.get_layout(page) for page in self._cached_pages())
  File "/usr/local/lib/python3.7/site-packages/pdfquery/pdfquery.py", line 603, in get_layout
    layout = self._add_annots(layout, page.annots)
  File "/usr/local/lib/python3.7/site-packages/pdfquery/pdfquery.py", line 647, in _add_annots
    annot = self._set_hwxy_attrs(annot)
  File "/usr/local/lib/python3.7/site-packages/pdfquery/pdfquery.py", line 665, in _set_hwxy_attrs
    attr['x0'] = bbox[0]
TypeError: 'PDFObjRef' object does not support indexing