maxpmaxp / pdfreader

Python API for PDF documents
MIT License
113 stars 26 forks source link

Creating SimplePDFViewer fails #56

Closed dwadler closed 3 years ago

dwadler commented 3 years ago

I'm getting the following error when trying to process a 2.6MB PDF document of tax assessment data. There were similar but different errors when using PDFDocument, page_one = next(doc.pages())

F:\riverby>python Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 23:09:28) [MSC v.1916 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information.

fd = open('assessment.pdf','rb') import pdfreader from pdfreader import SimplePDFViewer viewer = SimplePDFViewer(fd) ERROR:root:Skipping broken stream Traceback (most recent call last): File "d:\python37\lib\site-packages\pdfreader\filters\flate.py", line 20, in decode data = zlib.decompress(data) zlib.error: Error -3 while decompressing data: incorrect header check ERROR:root:!!!Failed to locate 6486 0: assuming null Traceback (most recent call last): File "d:\python37\lib\site-packages\pdfreader\parsers\document.py", line 458, in locateobject = self.next_brute_force_object() File "d:\python37\lib\site-packages\pdfreader\parsers\document.py", line 488, in next_brute_force_object obj = self.body_element() # can be either indirect object, startxref or trailer File "d:\python37\lib\site-packages\pdfreader\parsers\document.py", line 261, in body_element obj = self.indirect_object() File "d:\python37\lib\site-packages\pdfreader\parsers\document.py", line 522, in indirect_object self.on_parsed_indirect_object(obj) File "d:\python37\lib\site-packages\pdfreader\parsers\document.py", line 406, in on_parsed_indirect_object self.registry.register(obj) File "d:\python37\lib\site-packages\pdfreader\registry.py", line 31, in register self.register_object_stream(obj.val) File "d:\python37\lib\site-packages\pdfreader\registry.py", line 43, in register_object_stream for obj in parser.objects(objstm["First"], objstm["N"]): File "d:\python37\lib\site-packages\pdfreader\parsers\objstm.py", line 11, in objects integers.append(self.non_negative_int()) File "d:\python37\lib\site-packages\pdfreader\parsers\base.py", line 268, in non_negative_int n = self.numeric() File "d:\python37\lib\site-packages\pdfreader\parsers\base.py", line 250, in numeric self.on_parser_error("Invalid numeric token") File "d:\python37\lib\site-packages\pdfreader\parsers\base.py", line 48, in on_parser_error raise self.exception_class(message) pdfreader.exceptions.ParserException: Invalid numeric token Traceback (most recent call last): File "", line 1, in File "d:\python37\lib\site-packages\pdfreader\viewer\simple.py", line 74, in init super(TextOperatorsMixin, self).init(*args, **kwargs) File "d:\python37\lib\site-packages\pdfreader\viewer\pdfviewer.py", line 183, in init super(PDFViewer, self).init(None, Resources(), self.graphics_state_stack_class()) File "d:\python37\lib\site-packages\pdfreader\viewer\pdfviewer.py", line 39, in init self.on_document_load() File "d:\python37\lib\site-packages\pdfreader\viewer\pdfviewer.py", line 247, in on_document_load self.navigate(self.current_page_number) File "d:\python37\lib\site-packages\pdfreader\viewer\pdfviewer.py", line 202, in navigate self._pages[n] = next(islice(self.doc.pages(), n - 1, n)) File "d:\python37\lib\site-packages\pdfreader\document.py", line 99, in pages return self.root.Pages.pages() AttributeError: 'NoneType' object has no attribute 'pages'

assessment.pdf

maxpmaxp commented 3 years ago

@dwadler Confirmed. It fails to parse object streams in the file. Not sure what the issue is. Will keep you updated.

maxpmaxp commented 3 years ago

@dwadler The document is encrypted.

Screen Shot 2020-11-11 at 10 17 26 PM

Encryption is currently not supported by pdfreader.

maxpmaxp commented 3 years ago

Duplicates #33

maxpmaxp commented 3 years ago

@dwadler Just added encrypted and password-protected files support, try v0.1.6.

https://pdfreader.readthedocs.io/en/latest/tutorial.html#encrypted-and-password-protected-pdf-files