maxpmaxp / pdfreader

Python API for PDF documents
MIT License
113 stars 26 forks source link

pdfreader.exceptions.ParserException: Invalid numeric token #70

Closed nahidsaikat closed 4 months ago

nahidsaikat commented 3 years ago

I am using pdfreader in one of my Django application. Normally pdfreader is working fine. But I need to process and get extracted text from a pdf file that is uploaded to the server. When I tried to use pdfreader with uploaded pdf file inside handler method that time I get this error.

My request handler function is as bellow.

def put(self, request, *args, **kwargs):
    file_obj = request.data['file']
    fs = FileSystemStorage()
    file = fs.save(file_obj.name, file_obj)
    file_path = fs.path(file)
    try:
        skills = LinkedInPDFParser(file_path).parse()
    except Exception as ex:
        pass
    finally:
        fs.delete(file)

    return Response(skills, status=status.HTTP_200_OK)

And the definition of LinkedInPDFParser is as below.

class LinkedInPDFParser:
    """
    A custom PDF parser class that will parse a PDF file exported from Linkedin.
    """
    def __init__(self, file_path):
        self.file_path = file_path

    def parse(self):
        """
        The main method that will parse the PDF file and extract data from it.
        """
        with open(self.file_path, "rb") as fd:
            viewer = SimplePDFViewer(fd)
            viewer.render()
            ts_index = -1
            for idx, val in enumerate(viewer.canvas.strings):
                if val == "Top Skills":
                    ts_index = idx
                    break
            skills = {
                'skill_1': viewer.canvas.strings[ts_index + 1],
                'skill_2': viewer.canvas.strings[ts_index + 2],
                'skill_3': viewer.canvas.strings[ts_index + 3],
            }
        return skills

I am working on Ububtu 20.04 Python Version 3.8.5 Django Version 3.1.5

maxpmaxp commented 3 years ago

@nahidsaikat Can you attach the PDF file and the full traceback please?

Also ensure that FileSystemStorage saves a binary file.

maxpmaxp commented 3 years ago

@nahidsaikat and updates?

Thomas-Boi commented 2 years ago

Hi @maxpmaxp,

I have the same error as OP. Here is the PDF that I tried to read: fdo-fundingapplication-demandedefinancement.pdf.

Here is the full error that I got. This only happened for this PDF. Before this error, there were dozens of the WARNING Binary data as well:

WARNING  root:simple.py:57 Binary data. Using default encoding. Possibly arg of unsupported operator: <class 'bytes'>
WARNING  root:simple.py:57 Binary data. Using default encoding. Possibly arg of unsupported operator: <class 'bytes'>
ERROR    root:flate.py:23 Skipping broken stream
Traceback (most recent call last):
  File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\filters\flate.py", line 20, in decode
    data = zlib.decompress(data)
zlib.error: Error -3 while decompressing data: incorrect header check
ERROR    root:document.py:503 !!!Failed to locate 58 0: assuming null
Traceback (most recent call last):
  File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 500, in locate_object
    _ = self.next_brute_force_object()
  File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 530, in next_brute_force_object
    obj = self.body_element() # can be either indirect object, startxref or trailer
  File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 279, in body_element
    obj = self.indirect_object()
  File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 565, in indirect_object
    self.on_parsed_indirect_object(obj)
  File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 441, in on_parsed_indirect_object
    self.registry.register(obj)
  File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\registry.py", line 31, in register
    self.register_object_stream(obj.val)
  File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\registry.py", line 43, in register_object_stream
    for obj in parser.objects(objstm["First"], objstm["N"]):
  File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\objstm.py", line 11, in objects
    integers.append(self.non_negative_int())
  File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\base.py", line 268, in non_negative_int
    n = self.numeric()
  File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\base.py", line 250, in numeric
    self.on_parser_error("Invalid numeric token")
  File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\base.py", line 48, in on_parser_error
    raise self.exception_class(message)
pdfreader.exceptions.ParserException: Invalid numeric token
ERROR    root:document.py:503 !!!Failed to locate 35 0: assuming null
Traceback (most recent call last):
  File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 500, in locate_object
    _ = self.next_brute_force_object()
  File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 530, in next_brute_force_object
    obj = self.body_element() # can be either indirect object, startxref or trailer
  File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 285, in body_element
    self.on_parser_error("Indirect object, startxref or trailer expected")
  File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\base.py", line 48, in on_parser_error
    raise self.exception_class(message)
pdfreader.exceptions.ParserException: Indirect object, startxref or trailer expected

Here is my code:

        response = requests.get(url)  # PDF url
        with open(pdf_path, "wb+") as file:
          file.write(response.content)

          # read from the beginning
          file.seek(0)
          # see https://pdfreader.readthedocs.io/en/latest/tutorial.html for tutorial
          pdf = SimplePDFViewer(file)
          pdf.render()
          content = pdf.canvas.strings # I just want the PDF text for further processing

I tested the code on both a Windows 10 and a Linux distro (AWS Lambda Linux). I'm using Python 3.9.

Let me know if you'd like more details.

EDIT:

maxpmaxp commented 2 years ago

Hey Thomas, Thank you for your feedback. I’ll check on that and let you know, -M

On Jun 28, 2022, at 7:08 PM, Thomas Bui @.***> wrote:

Hi @maxpmaxp https://github.com/maxpmaxp,

I have the same error as OP. Here is the PDF that I tried to read: fdo-fundingapplication-demandedefinancement.pdf https://github.com/maxpmaxp/pdfreader/files/9005933/fdo-fundingapplication-demandedefinancement.pdf.

Here is the full error that I got. This only happened for this PDF. Before this error, there were dozens of the WARNING Binary data as well:

WARNING root:simple.py:57 Binary data. Using default encoding. Possibly arg of unsupported operator: <class 'bytes'> WARNING root:simple.py:57 Binary data. Using default encoding. Possibly arg of unsupported operator: <class 'bytes'> ERROR root:flate.py:23 Skipping broken stream Traceback (most recent call last): File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\filters\flate.py", line 20, in decode data = zlib.decompress(data) zlib.error: Error -3 while decompressing data: incorrect header check ERROR root:document.py:503 !!!Failed to locate 58 0: assuming null Traceback (most recent call last): File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 500, in locateobject = self.next_brute_force_object() File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 530, in next_brute_force_object obj = self.body_element() # can be either indirect object, startxref or trailer File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 279, in body_element obj = self.indirect_object() File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 565, in indirect_object self.on_parsed_indirect_object(obj) File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 441, in on_parsed_indirect_object self.registry.register(obj) File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\registry.py", line 31, in register self.register_object_stream(obj.val) File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\registry.py", line 43, in register_object_stream for obj in parser.objects(objstm["First"], objstm["N"]): File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\objstm.py", line 11, in objects integers.append(self.non_negative_int()) File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\base.py", line 268, in non_negative_int n = self.numeric() File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\base.py", line 250, in numeric self.on_parser_error("Invalid numeric token") File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\base.py", line 48, in on_parser_error raise self.exception_class(message) pdfreader.exceptions.ParserException: Invalid numeric token ERROR root:document.py:503 !!!Failed to locate 35 0: assuming null Traceback (most recent call last): File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 500, in locateobject = self.next_brute_force_object() File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 530, in next_brute_force_object obj = self.body_element() # can be either indirect object, startxref or trailer File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 285, in body_element self.on_parser_error("Indirect object, startxref or trailer expected") File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\base.py", line 48, in on_parser_error raise self.exception_class(message) pdfreader.exceptions.ParserException: Indirect object, startxref or trailer expected Here is my code:

    response = requests.get(url)  # PDF url
    with open(pdf_path, "wb+") as file:
      file.write(response.content)

      # read from the beginning
      file.seek(0)
      # see https://pdfreader.readthedocs.io/en/latest/tutorial.html for tutorial
      pdf = SimplePDFViewer(file)
      pdf.render()
      content = pdf.canvas.strings # I just want the PDF text for further processing

Let me know if you'd like more details.

— Reply to this email directly, view it on GitHub https://github.com/maxpmaxp/pdfreader/issues/70#issuecomment-1169373407, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEJM2DMX7SMPNJKWGKJY6DVROAVFANCNFSM4XTHTJIQ. You are receiving this because you were mentioned.

maxpmaxp commented 2 years ago

@nahidsaikat @Thomas-Boi The issue is fixed. Some encrypted files were under impact when Encrypt in trailer is an indirect reference missing from Xref table. The patch is on master and is going to be a part of upcoming v0.1.12

maxpmaxp commented 2 years ago

Hey Thomas! I’ve fixed the issue and the patch is on master. It’s going to be a part of upcoming v0.1.12. You can read and navigate your PDF. However it might be complicated as this specific doc uses Arcoform extension. So you probably need to navigate through the document structure rather that just calling viewer.render()

Best, Max

On Jun 28, 2022, at 7:08 PM, Thomas Bui @.***> wrote:

Hi @maxpmaxp https://github.com/maxpmaxp,

I have the same error as OP. Here is the PDF that I tried to read: fdo-fundingapplication-demandedefinancement.pdf https://github.com/maxpmaxp/pdfreader/files/9005933/fdo-fundingapplication-demandedefinancement.pdf.

Here is the full error that I got. This only happened for this PDF. Before this error, there were dozens of the WARNING Binary data as well:

WARNING root:simple.py:57 Binary data. Using default encoding. Possibly arg of unsupported operator: <class 'bytes'> WARNING root:simple.py:57 Binary data. Using default encoding. Possibly arg of unsupported operator: <class 'bytes'> ERROR root:flate.py:23 Skipping broken stream Traceback (most recent call last): File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\filters\flate.py", line 20, in decode data = zlib.decompress(data) zlib.error: Error -3 while decompressing data: incorrect header check ERROR root:document.py:503 !!!Failed to locate 58 0: assuming null Traceback (most recent call last): File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 500, in locateobject = self.next_brute_force_object() File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 530, in next_brute_force_object obj = self.body_element() # can be either indirect object, startxref or trailer File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 279, in body_element obj = self.indirect_object() File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 565, in indirect_object self.on_parsed_indirect_object(obj) File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 441, in on_parsed_indirect_object self.registry.register(obj) File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\registry.py", line 31, in register self.register_object_stream(obj.val) File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\registry.py", line 43, in register_object_stream for obj in parser.objects(objstm["First"], objstm["N"]): File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\objstm.py", line 11, in objects integers.append(self.non_negative_int()) File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\base.py", line 268, in non_negative_int n = self.numeric() File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\base.py", line 250, in numeric self.on_parser_error("Invalid numeric token") File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\base.py", line 48, in on_parser_error raise self.exception_class(message) pdfreader.exceptions.ParserException: Invalid numeric token ERROR root:document.py:503 !!!Failed to locate 35 0: assuming null Traceback (most recent call last): File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 500, in locateobject = self.next_brute_force_object() File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 530, in next_brute_force_object obj = self.body_element() # can be either indirect object, startxref or trailer File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\document.py", line 285, in body_element self.on_parser_error("Indirect object, startxref or trailer expected") File "D:\work\pocketedio\webscraper-hash-lambda\venv\lib\site-packages\pdfreader\parsers\base.py", line 48, in on_parser_error raise self.exception_class(message) pdfreader.exceptions.ParserException: Indirect object, startxref or trailer expected Here is my code:

    response = requests.get(url)  # PDF url
    with open(pdf_path, "wb+") as file:
      file.write(response.content)

      # read from the beginning
      file.seek(0)
      # see https://pdfreader.readthedocs.io/en/latest/tutorial.html for tutorial
      pdf = SimplePDFViewer(file)
      pdf.render()
      content = pdf.canvas.strings # I just want the PDF text for further processing

Let me know if you'd like more details.

— Reply to this email directly, view it on GitHub https://github.com/maxpmaxp/pdfreader/issues/70#issuecomment-1169373407, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEJM2DMX7SMPNJKWGKJY6DVROAVFANCNFSM4XTHTJIQ. You are receiving this because you were mentioned.

Thomas-Boi commented 2 years ago

Thank you very much for the fix 👍

Velby commented 1 year ago

I'm getting the same issue, both pdfreader.PDFDocument(fd) and pdfreader.SimplePDFViewer(fd) are raising pdfreader.exceptions.ParserException: Invalid numeric token. I'm running pdfreader v0.1.12

I can't share the PDF here unfortunately, it's a bill from PG&E and their "sample bill" isn't running into the same issue.

Full trace:

Traceback (most recent call last): File "/snap/pycharm-community/302/plugins/python-ce/helpers/pydev/pydevd.py", line 1496, in _exec pydev_imports.execfile(file, globals, locals) # execute the script File "/snap/pycharm-community/302/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "/home/velby/software/metrictime/pdfextract.py", line 4, in v = pdfreader.SimplePDFViewer(open('/tmp/pge.pdf','rb')) File "/home/velby/.local/lib/python3.8/site-packages/pdfreader/viewer/simple.py", line 194, in init super(SimplePDFViewer, self).init(*args, *kwargs) File "/home/velby/.local/lib/python3.8/site-packages/pdfreader/viewer/simple.py", line 74, in init super(TextOperatorsMixin, self).init(args, **kwargs) File "/home/velby/.local/lib/python3.8/site-packages/pdfreader/viewer/pdfviewer.py", line 219, in init self.doc = PDFDocument(fobj, password=password) File "/home/velby/.local/lib/python3.8/site-packages/pdfreader/document.py", line 31, in init self.parser = RegistryPDFParser(fobj, self.registry) File "/home/velby/.local/lib/python3.8/site-packages/pdfreader/parsers/document.py", line 419, in init self.trailer = self.pdf_trailer() File "/home/velby/.local/lib/python3.8/site-packages/pdfreader/parsers/document.py", line 196, in pdf_trailer obj = self.indirect_object() File "/home/velby/.local/lib/python3.8/site-packages/pdfreader/parsers/document.py", line 618, in indirect_object obj = super(RegistryPDFParser, self).indirect_object() File "/home/velby/.local/lib/python3.8/site-packages/pdfreader/parsers/document.py", line 62, in indirect_object num = self.non_negative_int() File "/home/velby/.local/lib/python3.8/site-packages/pdfreader/parsers/base.py", line 271, in non_negative_int n = self.numeric() File "/home/velby/.local/lib/python3.8/site-packages/pdfreader/parsers/base.py", line 253, in numeric self.on_parser_error("Invalid numeric token") File "/home/velby/.local/lib/python3.8/site-packages/pdfreader/parsers/base.py", line 51, in on_parser_error raise self.exception_class(message) pdfreader.exceptions.ParserException: Invalid numeric token python-BaseException

maxpmaxp commented 1 year ago

@Velby Is it an option to strip all PII and share the sample?

JeanCarlosChavarriaHughes commented 1 year ago

Hello @maxpmaxp

We are getting the same warning: WARNING:root:Binary data. Using default encoding. Possibly arg of unsupported operator: <class 'bytes'>

The pdfreader version is 0.1.12:

pip3 show pdfreader 
Name: pdfreader
Version: 0.1.12
Summary: Pythonic API for parsing PDF files
Home-page: http://github.com/maxpmaxp/pdfreader
Author: Maksym Polshcha
Author-email: maxp@sterch.net
License: MIT Licence

The sample PDF (public): bol_16_05_2023.pdf

How to reproduce?

  1. Safe the PDF file with the given filename.
  2. Copy paste the following python3 code (main.py):
    
    from pdfreader import SimplePDFViewer

def get_text_from_pdf(filename): with open(filename, 'rb') as f:

Create a SimplePDFViewer object

    viewer = SimplePDFViewer(f)
    final_text = ""

    # Loop through each page in the PDF file
    for canvas in viewer:
        # Extract the text from the page
        page_text = ''.join(canvas.strings)

        # Print the text
        # print(page_text)
        final_text += page_text

get_text_from_pdf('bol_16_05_2023.pdf')


3. Run the Code with python3.

Huge number of WARNING received:
`WARNING:root:Binary data. Using default encoding. Possibly arg of unsupported operator: <class 'bytes'>`
shaojun commented 10 months ago

@JeanCarlosChavarriaHughes I got the same issue, do you have any solutions?

JeanCarlosChavarriaHughes commented 10 months ago

Hello @shaojun . No, I found no solution. I moved to another library, pdfminer

maxpmaxp commented 4 months ago

@JeanCarlosChavarriaHughes @shaojun I have changed the log level to debug for this message to reduce the noise.