mfenniak / pyPdf

Pure-Python PDF Library; this repository is no longer maintained, please see https://github.com/knowah/PyPDF2/ insead.
https://github.com/knowah/PyPDF2/
Other
276 stars 85 forks source link

pyPdf.utils.PdfReadError: multiple definitions in dictionary #13

Open willfill opened 13 years ago

willfill commented 13 years ago

i have some code :

import pyPdf

def getPDFContent(): content = ""

Load PDF into pyPDF

pdf = pyPdf.PdfFileReader(file(pathToPdf, 'rb'))
# Iterate pages
print pdf.documentInfo
for i in range(0, pdf.getNumPages()):
    # Extract text from page and add to content
    content += pdf.getPage(i).extractText() + " \n"
# Collapse whitespace
content = u" ".join(content.replace(u"\xa0", u" ").strip().split())
return content

f = open(pathToTxt,'w+') f.write(getPDFContent()) f.close()

where pathToPdf and pathToTxt it is absolute path to the files. but i got error : Traceback (most recent call last): File "C:/Users/will/Desktop/coding/mytest.py", line 21, in print pdf.getPage(14) File "C:\Python\lib\site-packages\pyPdf\pdf.py", line 450, in getPage self._flatten() File "C:\Python\lib\site-packages\pyPdf\pdf.py", line 607, in _flatten self._flatten(page.getObject(), inherit, **addt) File "C:\Python\lib\site-packages\pyPdf\generic.py", line 165, in getObject return self.pdf.getObject(self).getObject() File "C:\Python\lib\site-packages\pyPdf\pdf.py", line 649, in getObject retval = readObject(self.stream, self) File "C:\Python\lib\site-packages\pyPdf\generic.py", line 67, in readObject return DictionaryObject.readFromStream(stream, pdf) File "C:\Python\lib\site-packages\pyPdf\generic.py", line 531, in readFromStream value = readObject(stream, pdf) File "C:\Python\lib\site-packages\pyPdf\generic.py", line 67, in readObject return DictionaryObject.readFromStream(stream, pdf) File "C:\Python\lib\site-packages\pyPdf\generic.py", line 531, in readFromStream value = readObject(stream, pdf) File "C:\Python\lib\site-packages\pyPdf\generic.py", line 67, in readObject return DictionaryObject.readFromStream(stream, pdf) File "C:\Python\lib\site-packages\pyPdf\generic.py", line 534, in readFromStream raise utils.PdfReadError, "multiple definitions in dictionary" pyPdf.utils.PdfReadError: multiple definitions in dictionary

sblzk commented 12 years ago

https://bugs.launchpad.net/pypdf/+bug/242755

mlavin commented 12 years ago

It isn't clear from the PDF spec whether duplicate keys should be allowed: http://pdf.editme.com/pdfua-docinfodictionary http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf (Section 10.2.1). The terminology (dictionary, key/value) seems to imply unique keys. It is clear that some programs are creating documents with duplicate keys making them unreadable by PyPDF due to this issue.