py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.17k stars 1.39k forks source link

Can't getData() from /Contents List #72

Closed ottothecow closed 2 years ago

ottothecow commented 10 years ago

I'm trying to dig deep into some PDFs by calling getData directly on part of a page (I am then parsing that data to find coordinates for a bit of text).

This worked for me in the past with essentially:

page = PdfFileReader(inpdf).getPage(0)
text = page.getContents().getData()   #<-- or page["/Contents"].getData()

but with my new PDFs, I am getting an error like this: "AttributeError: 'ArrayObject' object has no attribute 'getData'"

Digging in, it looks like my old PDF was structured like this (print page) with a single IndirectObject in the contents.

{'/Contents': IndirectObject(14, 0),
 '/MediaBox': [0, 0, 662.40000, 792],
 '/Parent': IndirectObject(1, 0),
 '/Resources': {'/Font': {'/F3': IndirectObject(10, 0),
                          '/F4': IndirectObject(7, 0),
                          '/F5': IndirectObject(4, 0)},
                '/ProcSet': IndirectObject(13, 0),
                '/XObject': {}},
 '/Type': '/Page'}

Then page.GetContents() returns:

{'/Filter': '/FlateDecode'}

while my new PDF is structured like this with a list of IndirectObjects in the contents:

{'/Contents': [IndirectObject(11, 0),
               IndirectObject(12, 0),
               IndirectObject(13, 0),
               IndirectObject(14, 0),
               IndirectObject(15, 0),
               IndirectObject(16, 0),
               IndirectObject(17, 0),
               IndirectObject(18, 0)],
 '/CropBox': [0, 0, 612, 792],
 '/MediaBox': [0, 0, 612, 792],
 '/Parent': IndirectObject(5, 0),
 '/Resources': {'/Font': {'/F3': IndirectObject(24, 0),
                          '/F4': IndirectObject(26, 0),
                          '/F6': IndirectObject(29, 0),
                          '/F7': IndirectObject(30, 0)},
                '/ProcSet': IndirectObject(31, 0),
                '/XObject': {}},
 '/Rotate': 0,
 '/Type': '/Page'}

then page.getContents() returns:

[IndirectObject(11, 0),
 IndirectObject(12, 0),
 IndirectObject(13, 0),
 IndirectObject(14, 0),
 IndirectObject(15, 0),
 IndirectObject(16, 0),
 IndirectObject(17, 0),
 IndirectObject(18, 0)]

How do I get at the underlying data of /Contents? going after the pieces of the list with page.getContents()[0] just returns the name of the object and I can't use getData() on that. I can't tell if this is a bug (caused by having a list as the contents) or if I am missing some feature.

mstamy2 commented 10 years ago

Hello, It's certainly possible that getContents() should be revised if /Contents for a page points to more than one indirect object. Would it be possible to provide a PDF that exhibits this behavior? I understand if any instances you have are confidential; if that's the case, then I can work without it.

ottothecow commented 10 years ago

Unfortunately, I can't share these PDFs. I'm not sure what created them either.

I will see if I can't find or create a PDF that exhibits the same behavior. The structure of the PDF seems to be multiple separate tables per page, and these separate tables are getting their own indirect object. This is partly why I am trying to use getData--I need to figure out which page contains a certain table (they can vary in length and thus pagination is not constant) and then figure out where on the page it is located.

vdavez commented 9 years ago

I actually had this same error on this document: http://www.supremecourt.gov/opinions/14pdf/14-7955_aplc.pdf Haven't spelunked to find out what's going on, but thought I'd share.

wgwei commented 8 years ago

As in you new pdf file the

_{ '/Contents': IndirectObject(14,0), ...}_

has been changed to:

_{ '/Contents': [IndirectObject(11,0), IndirectObject(12,0), ...], ...}_, 

you will have to modify the syntax to :

page = PdfReader(inpdf).pages[0]
text = page.getContents()[_n_].getData()  # where _n_ is an index to locate the indirectObject location. 
puneetsinha commented 7 years ago

I am facing the same issue... seeking help

~\AppData\Local\Continuum\Anaconda3\envs\tensorflow\lib\site-packages\PyPDF2\pdf.py in extractText(self)
   2655         """
   2656         text = u_("")
-> 2657         content = self["/Contents"].getObject()
   2658         if not isinstance(content, ContentStream):
   2659             content = ContentStream(content, self.pdf)

~\AppData\Local\Continuum\Anaconda3\envs\tensorflow\lib\site-packages\PyPDF2\generic.py in __getitem__(self, key)
    516 
    517     def __getitem__(self, key):
--> 518         return dict.__getitem__(self, key).getObject()
    519 
    520     ##

KeyError: '/Contents'