py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.36k stars 1.41k forks source link

PyPDF2 warning in PyPDF2 1.19 #36

Closed SharmileeS closed 10 years ago

SharmileeS commented 11 years ago

PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will not be corrected. [pdf.py:1130]

mstamy2 commented 10 years ago

Hello, This warning means that the first section of the xref table does not begin with object zero. There may have been an error in writing the PDF. If strict = False, PyPDF2 will try to correct the object ID numbers. If strict = True, they will not be corrected. Usually this is not fatal, and Adobe can read the PDF and the PyPDF2 output. Was there an issue you were having with this warning or did you just want to know what it meant?

SharmileeS commented 10 years ago

Just wanted to know what it meant and is there anything which can be done to avoid this warning. Thanks a lot.

nagordon commented 10 years ago

I am receiving the same error using the PdfFileMerger() tool. It appears to be only some pdfs. Any thoughts?

mstamy2 commented 10 years ago

Some improvements were recently added to the algorithm for reading Xrefs (a day or so ago), but if you're still getting this warning, it's generally not an issue. Your output PDFs should still appear as expected - if they don't, then that is an issue.

Usually if non zero-indexed Xrefs actually pose a problem, you will get an actual exception. If all you get is a warning, you likely have nothing to worry about.

nagordon commented 10 years ago

The update you referred to in pdy.py worked! Thanks for a handy tool!

shanavas786 commented 7 years ago

Disabling strict mode did not help me.

Repairing the document with Ghostscript did the job. Thread here

LoginWithGithub commented 6 years ago

You said:“Usually this is not fatal”,but it seems that something is wrong...

C:\Users\Helloworld\Desktop\pdf>ipython
Python 3.6.2 (v3.6.2:5fd33b5, Jul  8 2017, 04:57:36) [MSC v.1900 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import PyPDF2

In [2]: pdf_fileobj = open("afile.pdf", "rb")

In [3]: pdf_reader = PyPDF2.PdfFileReader(pdf_fileobj)
PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]

In [4]: pdf_reader.numPages
---------------------------------------------------------------------------
PdfReadError                              Traceback (most recent call last)
<ipython-input-4-29a8f66e8824> in <module>()
----> 1 pdf_reader.numPages

c:\program files\python36\lib\site-packages\PyPDF2\pdf.py in <lambda>(self)
   1156             return len(self.flattenedPages)
   1157
-> 1158     numPages = property(lambda self: self.getNumPages(), None, None)
   1159     """
   1160     Read-only property that accesses the

c:\program files\python36\lib\site-packages\PyPDF2\pdf.py in getNumPages(self)
   1153         else:
   1154             if self.flattenedPages == None:
-> 1155                 self._flatten()
   1156             return len(self.flattenedPages)
   1157

c:\program files\python36\lib\site-packages\PyPDF2\pdf.py in _flatten(self, pages, inherit, indirectRef)
   1503         if pages == None:
   1504             self.flattenedPages = []
-> 1505             catalog = self.trailer["/Root"].getObject()
   1506             pages = catalog["/Pages"].getObject()
   1507

c:\program files\python36\lib\site-packages\PyPDF2\generic.py in __getitem__(self, key)
    514
    515     def __getitem__(self, key):
--> 516         return dict.__getitem__(self, key).getObject()
    517
    518     ##

c:\program files\python36\lib\site-packages\PyPDF2\generic.py in getObject(self)
    176
    177     def getObject(self):
--> 178         return self.pdf.getObject(self).getObject()
    179
    180     def __repr__(self):

c:\program files\python36\lib\site-packages\PyPDF2\pdf.py in getObject(self, indirectReference)
   1602                 if self.strict:
   1603                     raise utils.PdfReadError("Expected object ID (%d %d) does not match actual (%d %d); xref table not zero-indexed." \
-> 1604                                      % (indirectReference.idnum, indirectReference.generation, idnum, generation))
   1605                 else: pass # xref table is corrected in non-strict mode
   1606             elif idnum != indirectReference.idnum:

PdfReadError: Expected object ID (8 0) does not match actual (7 0); xref table not zero-indexed.

In [5]: pdf_reader.getNumPages()
Out[5]: 0

Actually, "afile.pdf" has 2 pages. I don't know why raise PdfReadError?

afile.pdf

bbg221 commented 4 years ago

I meet this issue and solved like pdfin = PdfFileReader(open('yqsq.pdf', 'rb'), strict=False) I also found that if the name of the pdf document is changed manually, issue comes.

prakher1992 commented 4 years ago

i have set strict=False like pdfin = PdfFileReader(open('yqsq.pdf', 'rb'), strict=False) , still i am not getting file content.

JON994 commented 4 years ago

PdfReadWarning: Superfluous whitespace found in object header b'12' b'0' [ pdf.py: 1665 ] Can anyone solved this error ?

Florakirie commented 2 years ago

I solved my same issue by PdfFileReader(open('yqsq.pdf', 'rb'), strict=False) and also deleted space in my pdf file name

pubpub-zz commented 2 years ago

@Florakirie, prefer to use PdfReader instead of PdfFileReader which is obsolescent. PdfReader assert strict=False as default value.

MartinThoma commented 2 years ago

I recommend to use PdfReader("yqsq.pdf"). It's simpler to read and you don't have dangling open file pointers.