Closed nahidsaikat closed 4 months ago
@nahidsaikat Can you attach the PDF file and the full traceback please?
Also ensure that FileSystemStorage
saves a binary file.
@nahidsaikat and updates?
Hi @maxpmaxp,
I have the same error as OP. Here is the PDF that I tried to read: fdo-fundingapplication-demandedefinancement.pdf.
Here is the full error that I got. This only happened for this PDF. Before this error, there were dozens of the WARNING Binary data
as well:
WARNING root:simple.py:57 Binary data. Using default encoding. Possibly arg of unsupported operator: <class 'bytes'>
WARNING root:simple.py:57 Binary data. Using default encoding. Possibly arg of unsupported operator: <class 'bytes'>
ERROR root:flate.py:23 Skipping broken stream
Traceback (most recent call last):
File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\filters\flate.py", line 20, in decode
data = zlib.decompress(data)
zlib.error: Error -3 while decompressing data: incorrect header check
ERROR root:document.py:503 !!!Failed to locate 58 0: assuming null
Traceback (most recent call last):
File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 500, in locate_object
_ = self.next_brute_force_object()
File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 530, in next_brute_force_object
obj = self.body_element() # can be either indirect object, startxref or trailer
File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 279, in body_element
obj = self.indirect_object()
File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 565, in indirect_object
self.on_parsed_indirect_object(obj)
File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 441, in on_parsed_indirect_object
self.registry.register(obj)
File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\registry.py", line 31, in register
self.register_object_stream(obj.val)
File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\registry.py", line 43, in register_object_stream
for obj in parser.objects(objstm["First"], objstm["N"]):
File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\objstm.py", line 11, in objects
integers.append(self.non_negative_int())
File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\base.py", line 268, in non_negative_int
n = self.numeric()
File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\base.py", line 250, in numeric
self.on_parser_error("Invalid numeric token")
File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\base.py", line 48, in on_parser_error
raise self.exception_class(message)
pdfreader.exceptions.ParserException: Invalid numeric token
ERROR root:document.py:503 !!!Failed to locate 35 0: assuming null
Traceback (most recent call last):
File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 500, in locate_object
_ = self.next_brute_force_object()
File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 530, in next_brute_force_object
obj = self.body_element() # can be either indirect object, startxref or trailer
File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 285, in body_element
self.on_parser_error("Indirect object, startxref or trailer expected")
File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\base.py", line 48, in on_parser_error
raise self.exception_class(message)
pdfreader.exceptions.ParserException: Indirect object, startxref or trailer expected
Here is my code:
response = requests.get(url) # PDF url
with open(pdf_path, "wb+") as file:
file.write(response.content)
# read from the beginning
file.seek(0)
# see https://pdfreader.readthedocs.io/en/latest/tutorial.html for tutorial
pdf = SimplePDFViewer(file)
pdf.render()
content = pdf.canvas.strings # I just want the PDF text for further processing
I tested the code on both a Windows 10 and a Linux distro (AWS Lambda Linux). I'm using Python 3.9.
Let me know if you'd like more details.
EDIT:
Hey Thomas, Thank you for your feedback. I’ll check on that and let you know, -M
On Jun 28, 2022, at 7:08 PM, Thomas Bui @.***> wrote:
Hi @maxpmaxp https://github.com/maxpmaxp,
I have the same error as OP. Here is the PDF that I tried to read: fdo-fundingapplication-demandedefinancement.pdf https://github.com/maxpmaxp/pdfreader/files/9005933/fdo-fundingapplication-demandedefinancement.pdf.
Here is the full error that I got. This only happened for this PDF. Before this error, there were dozens of the WARNING Binary data as well:
WARNING root:simple.py:57 Binary data. Using default encoding. Possibly arg of unsupported operator: <class 'bytes'> WARNING root:simple.py:57 Binary data. Using default encoding. Possibly arg of unsupported operator: <class 'bytes'> ERROR root:flate.py:23 Skipping broken stream Traceback (most recent call last): File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\filters\flate.py", line 20, in decode data = zlib.decompress(data) zlib.error: Error -3 while decompressing data: incorrect header check ERROR root:document.py:503 !!!Failed to locate 58 0: assuming null Traceback (most recent call last): File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 500, in locateobject = self.next_brute_force_object() File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 530, in next_brute_force_object obj = self.body_element() # can be either indirect object, startxref or trailer File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 279, in body_element obj = self.indirect_object() File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 565, in indirect_object self.on_parsed_indirect_object(obj) File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 441, in on_parsed_indirect_object self.registry.register(obj) File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\registry.py", line 31, in register self.register_object_stream(obj.val) File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\registry.py", line 43, in register_object_stream for obj in parser.objects(objstm["First"], objstm["N"]): File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\objstm.py", line 11, in objects integers.append(self.non_negative_int()) File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\base.py", line 268, in non_negative_int n = self.numeric() File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\base.py", line 250, in numeric self.on_parser_error("Invalid numeric token") File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\base.py", line 48, in on_parser_error raise self.exception_class(message) pdfreader.exceptions.ParserException: Invalid numeric token ERROR root:document.py:503 !!!Failed to locate 35 0: assuming null Traceback (most recent call last): File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 500, in locateobject = self.next_brute_force_object() File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 530, in next_brute_force_object obj = self.body_element() # can be either indirect object, startxref or trailer File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 285, in body_element self.on_parser_error("Indirect object, startxref or trailer expected") File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\base.py", line 48, in on_parser_error raise self.exception_class(message) pdfreader.exceptions.ParserException: Indirect object, startxref or trailer expected Here is my code:
response = requests.get(url) # PDF url with open(pdf_path, "wb+") as file: file.write(response.content) # read from the beginning file.seek(0) # see https://pdfreader.readthedocs.io/en/latest/tutorial.html for tutorial pdf = SimplePDFViewer(file) pdf.render() content = pdf.canvas.strings # I just want the PDF text for further processing
Let me know if you'd like more details.
— Reply to this email directly, view it on GitHub https://github.com/maxpmaxp/pdfreader/issues/70#issuecomment-1169373407, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEJM2DMX7SMPNJKWGKJY6DVROAVFANCNFSM4XTHTJIQ. You are receiving this because you were mentioned.
@nahidsaikat @Thomas-Boi The issue is fixed. Some encrypted files were under impact when Encrypt
in trailer is an indirect reference missing from Xref table. The patch is on master and is going to be a part of upcoming v0.1.12
Hey Thomas!
I’ve fixed the issue and the patch is on master. It’s going to be a part of upcoming v0.1.12.
You can read and navigate your PDF. However it might be complicated as this specific doc uses Arcoform extension. So you probably need to navigate through the document structure rather that just calling viewer.render()
Best, Max
On Jun 28, 2022, at 7:08 PM, Thomas Bui @.***> wrote:
Hi @maxpmaxp https://github.com/maxpmaxp,
I have the same error as OP. Here is the PDF that I tried to read: fdo-fundingapplication-demandedefinancement.pdf https://github.com/maxpmaxp/pdfreader/files/9005933/fdo-fundingapplication-demandedefinancement.pdf.
Here is the full error that I got. This only happened for this PDF. Before this error, there were dozens of the WARNING Binary data as well:
WARNING root:simple.py:57 Binary data. Using default encoding. Possibly arg of unsupported operator: <class 'bytes'> WARNING root:simple.py:57 Binary data. Using default encoding. Possibly arg of unsupported operator: <class 'bytes'> ERROR root:flate.py:23 Skipping broken stream Traceback (most recent call last): File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\filters\flate.py", line 20, in decode data = zlib.decompress(data) zlib.error: Error -3 while decompressing data: incorrect header check ERROR root:document.py:503 !!!Failed to locate 58 0: assuming null Traceback (most recent call last): File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 500, in locateobject = self.next_brute_force_object() File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 530, in next_brute_force_object obj = self.body_element() # can be either indirect object, startxref or trailer File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 279, in body_element obj = self.indirect_object() File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 565, in indirect_object self.on_parsed_indirect_object(obj) File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 441, in on_parsed_indirect_object self.registry.register(obj) File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\registry.py", line 31, in register self.register_object_stream(obj.val) File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\registry.py", line 43, in register_object_stream for obj in parser.objects(objstm["First"], objstm["N"]): File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\objstm.py", line 11, in objects integers.append(self.non_negative_int()) File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\base.py", line 268, in non_negative_int n = self.numeric() File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\base.py", line 250, in numeric self.on_parser_error("Invalid numeric token") File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\base.py", line 48, in on_parser_error raise self.exception_class(message) pdfreader.exceptions.ParserException: Invalid numeric token ERROR root:document.py:503 !!!Failed to locate 35 0: assuming null Traceback (most recent call last): File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 500, in locateobject = self.next_brute_force_object() File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 530, in next_brute_force_object obj = self.body_element() # can be either indirect object, startxref or trailer File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 285, in body_element self.on_parser_error("Indirect object, startxref or trailer expected") File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\base.py", line 48, in on_parser_error raise self.exception_class(message) pdfreader.exceptions.ParserException: Indirect object, startxref or trailer expected Here is my code:
response = requests.get(url) # PDF url with open(pdf_path, "wb+") as file: file.write(response.content) # read from the beginning file.seek(0) # see https://pdfreader.readthedocs.io/en/latest/tutorial.html for tutorial pdf = SimplePDFViewer(file) pdf.render() content = pdf.canvas.strings # I just want the PDF text for further processing
Let me know if you'd like more details.
— Reply to this email directly, view it on GitHub https://github.com/maxpmaxp/pdfreader/issues/70#issuecomment-1169373407, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEJM2DMX7SMPNJKWGKJY6DVROAVFANCNFSM4XTHTJIQ. You are receiving this because you were mentioned.
Thank you very much for the fix 👍
I'm getting the same issue, both pdfreader.PDFDocument(fd) and pdfreader.SimplePDFViewer(fd) are raising pdfreader.exceptions.ParserException: Invalid numeric token. I'm running pdfreader v0.1.12
I can't share the PDF here unfortunately, it's a bill from PG&E and their "sample bill" isn't running into the same issue.
Full trace:
Traceback (most recent call last): File "/snap/pycharm-community/302/plugins/python-ce/helpers/pydev/pydevd.py", line 1496, in _exec pydev_imports.execfile(file, globals, locals) # execute the script File "/snap/pycharm-community/302/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "/home/velby/software/metrictime/pdfextract.py", line 4, in
v = pdfreader.SimplePDFViewer(open('/tmp/pge.pdf','rb')) File "/home/velby/.local/lib/python3.8/site-packages/pdfreader/viewer/simple.py", line 194, in init super(SimplePDFViewer, self).init(*args, *kwargs) File "/home/velby/.local/lib/python3.8/site-packages/pdfreader/viewer/simple.py", line 74, in init super(TextOperatorsMixin, self).init(args, **kwargs) File "/home/velby/.local/lib/python3.8/site-packages/pdfreader/viewer/pdfviewer.py", line 219, in init self.doc = PDFDocument(fobj, password=password) File "/home/velby/.local/lib/python3.8/site-packages/pdfreader/document.py", line 31, in init self.parser = RegistryPDFParser(fobj, self.registry) File "/home/velby/.local/lib/python3.8/site-packages/pdfreader/parsers/document.py", line 419, in init self.trailer = self.pdf_trailer() File "/home/velby/.local/lib/python3.8/site-packages/pdfreader/parsers/document.py", line 196, in pdf_trailer obj = self.indirect_object() File "/home/velby/.local/lib/python3.8/site-packages/pdfreader/parsers/document.py", line 618, in indirect_object obj = super(RegistryPDFParser, self).indirect_object() File "/home/velby/.local/lib/python3.8/site-packages/pdfreader/parsers/document.py", line 62, in indirect_object num = self.non_negative_int() File "/home/velby/.local/lib/python3.8/site-packages/pdfreader/parsers/base.py", line 271, in non_negative_int n = self.numeric() File "/home/velby/.local/lib/python3.8/site-packages/pdfreader/parsers/base.py", line 253, in numeric self.on_parser_error("Invalid numeric token") File "/home/velby/.local/lib/python3.8/site-packages/pdfreader/parsers/base.py", line 51, in on_parser_error raise self.exception_class(message) pdfreader.exceptions.ParserException: Invalid numeric token python-BaseException
@Velby Is it an option to strip all PII and share the sample?
Hello @maxpmaxp
We are getting the same warning:
WARNING:root:Binary data. Using default encoding. Possibly arg of unsupported operator: <class 'bytes'>
The pdfreader version is 0.1.12:
pip3 show pdfreader
Name: pdfreader
Version: 0.1.12
Summary: Pythonic API for parsing PDF files
Home-page: http://github.com/maxpmaxp/pdfreader
Author: Maksym Polshcha
Author-email: maxp@sterch.net
License: MIT Licence
The sample PDF (public): bol_16_05_2023.pdf
How to reproduce?
from pdfreader import SimplePDFViewer
def get_text_from_pdf(filename): with open(filename, 'rb') as f:
viewer = SimplePDFViewer(f)
final_text = ""
# Loop through each page in the PDF file
for canvas in viewer:
# Extract the text from the page
page_text = ''.join(canvas.strings)
# Print the text
# print(page_text)
final_text += page_text
get_text_from_pdf('bol_16_05_2023.pdf')
3. Run the Code with python3.
Huge number of WARNING received:
`WARNING:root:Binary data. Using default encoding. Possibly arg of unsupported operator: <class 'bytes'>`
@JeanCarlosChavarriaHughes I got the same issue, do you have any solutions?
Hello @shaojun . No, I found no solution. I moved to another library, pdfminer
@JeanCarlosChavarriaHughes @shaojun I have changed the log level to debug for this message to reduce the noise.
I am using
pdfreader
in one of my Django application. Normallypdfreader
is working fine. But I need to process and get extracted text from a pdf file that is uploaded to the server. When I tried to usepdfreader
with uploaded pdf file inside handler method that time I get this error.My request handler function is as bellow.
And the definition of LinkedInPDFParser is as below.
I am working on Ububtu 20.04 Python Version 3.8.5 Django Version 3.1.5