jcushman / pdfquery

A fast and friendly PDF scraping library.
MIT License
772 stars 89 forks source link

TypeError: 'PDFObjRef' object is not subscriptable #92

Open sgpinkus opened 5 months ago

sgpinkus commented 5 months ago

Getting "TypeError: 'PDFObjRef' object is not subscriptable" with some PDFs from the Internet.

Test script:

import requests
import pdfquery
res = requests.get('https://www.aph.gov.au/Senators_and_Members/Members/Register/-/media/03_Senators_and_Members/32_Members/Register/47P/AB/Albanese_47P.pdf?la=en&hash=E76C6FAA27171CFB2A95FC26EA0A1E45084F69C1')
with open('test.pdf', 'wb') as f: f.write(res.content)
pdf = pdfquery.PDFQuery('test.pdf')
pdf.load()

Gives:

Traceback (most recent call last):
  File "/tmp/test2.py", line 6, in <module>
    pdf.load()
  File "/tmp/venv/lib/python3.9/site-packages/pdfquery/pdfquery.py", line 385, in load
    self.tree = self.get_tree(*_flatten(page_numbers))
  File "/tmp/venv/lib/python3.9/site-packages/pdfquery/pdfquery.py", line 487, in get_tree
    for n, page in pages:
  File "/tmp/venv/lib/python3.9/site-packages/pdfquery/pdfquery.py", line 608, in <genexpr>
    return (self.get_layout(page) for page in self._cached_pages())
  File "/tmp/venv/lib/python3.9/site-packages/pdfquery/pdfquery.py", line 603, in get_layout
    layout = self._add_annots(layout, page.annots)
  File "/tmp/venv/lib/python3.9/site-packages/pdfquery/pdfquery.py", line 647, in _add_annots
    annot = self._set_hwxy_attrs(annot)
  File "/tmp/venv/lib/python3.9/site-packages/pdfquery/pdfquery.py", line 665, in _set_hwxy_attrs
    attr['x0'] = bbox[0]
TypeError: 'PDFObjRef' object is not subscriptable

Can open the PDF test.pdf with multiple viewers installed on system no problem.