maxpmaxp / pdfreader

Python API for PDF documents
MIT License
113 stars 26 forks source link

[SimplePDFViewer] TypeError: 'IndirectReference' object cannot be interpreted as an integer` #34

Closed FedericoNembrini closed 4 years ago

FedericoNembrini commented 4 years ago

I'm trying to read the content of a PDF (link, attached_file), with this simple code:

import os import pdfreader from pdfreader import SimplePDFViewer

if os.path.isfile('./Test.pdf'): fd = open('./Test.pdf', 'rb') viewer = SimplePDFViewer(fd)

Error:

Traceback (most recent call last): File "main.py", line 7, in <module> viewer = SimplePDFViewer(fd) File "/home/user/.local/lib/python3.6/site-packages/pdfreader/viewer/simple.py", line 62, in __init__ super(TextOperatorsMixin, self).__init__(*args, **kwargs) File "/home/user/.local/lib/python3.6/site-packages/pdfreader/viewer/pdfviewer.py", line 183, in __init__ super(PDFViewer, self).__init__(None, Resources(), self.graphics_state_stack_class()) File "/home/user/.local/lib/python3.6/site-packages/pdfreader/viewer/pdfviewer.py", line 39, in __init__ self.on_document_load() File "/home/user/.local/lib/python3.6/site-packages/pdfreader/viewer/pdfviewer.py", line 247, in on_document_load self.navigate(self.current_page_number) File "/home/user/.local/lib/python3.6/site-packages/pdfreader/viewer/pdfviewer.py", line 207, in navigate self.after_navigate(n) File "/home/user/.local/lib/python3.6/site-packages/pdfreader/viewer/simple.py", line 182, in after_navigate super(SimplePDFViewer, self).after_navigate(n) File "/home/user/.local/lib/python3.6/site-packages/pdfreader/viewer/pdfviewer.py", line 258, in after_navigate if isinstance(self.current_page.Contents, StreamBasedObject): File "/home/user/.local/lib/python3.6/site-packages/pdfreader/types/objects.py", line 85, in __getattr__ return self.get(item) File "/home/user/.local/lib/python3.6/site-packages/pdfreader/types/objects.py", line 101, in get val = self[item] File "/home/user/.local/lib/python3.6/site-packages/pdfreader/types/objects.py", line 91, in __getitem__ obj = self.doc.build(obj, lazy=True) File "/home/user/.local/lib/python3.6/site-packages/pdfreader/document.py", line 59, in build obj = self.obj_by_ref(obj) File "/home/user/.local/lib/python3.6/site-packages/pdfreader/document.py", line 135, in obj_by_ref obj = self.locate_object(objref.num, objref.gen) File "/home/user/.local/lib/python3.6/site-packages/pdfreader/document.py", line 110, in locate_object self.parser.indirect_object() File "/home/user/.local/lib/python3.6/site-packages/pdfreader/parsers/document.py", line 400, in indirect_object obj = super(RegistryPDFParser, self).indirect_object() File "/home/user/.local/lib/python3.6/site-packages/pdfreader/parsers/document.py", line 56, in indirect_object val = self.object() File "/home/user/.local/lib/python3.6/site-packages/pdfreader/parsers/base.py", line 629, in object val = method() File "/home/user/.local/lib/python3.6/site-packages/pdfreader/parsers/base.py", line 334, in dictionary_or_stream_or_hexstring val = self._stream(val) File "/home/user/.local/lib/python3.6/site-packages/pdfreader/parsers/base.py", line 432, in _stream data = self.read(length) File "/home/user/.local/lib/python3.6/site-packages/pdfreader/parsers/base.py", line 34, in read return self.buffer.read(n) File "/home/user/.local/lib/python3.6/site-packages/pdfreader/buffer.py", line 166, in read return b''.join([self.next() for _ in range(n)]) TypeError: 'IndirectReference' object cannot be interpreted as an integer

maxpmaxp commented 4 years ago

@FedericoNembrini Thank you for submitting the issue.

A stream object number 5 generation 0 has indirect reference 0 6 R as Length key where an integer is expected. See PDF 1.7 specification sec. 7.3.8.2.

5 0 obj
<</Length 6 0 R/Filter /FlateDecode>>
stream
...

My Adobe Acrobat reads it correctly though.

This definitely should be fixed asap. Will keep you in the loop.

maxpmaxp commented 4 years ago

@FedericoNembrini Here comes a good discussion on the topic.

https://stackoverflow.com/questions/50325459/how-to-parse-a-binary-pdf-stream-of-unknown-length/50334477#50334477

maxpmaxp commented 4 years ago

@FedericoNembrini Fixed. Checkout from the master branch or wait for upcoming v0.1.4

FedericoNembrini commented 4 years ago

Version 0.1.4 works without problems.

Thank you!