jorisschellekens / borb

borb is a library for reading, creating and manipulating PDF files in python.
https://borbpdf.com/
Other
3.37k stars 148 forks source link

Borb: Assertion Error // SimpleFindReplace() in canvas_stream_processor.py #166

Closed pat-mw closed 1 year ago

pat-mw commented 1 year ago

Hey there!

I've been bashing my head around stack overflow trying to figure out the best way to load a pdf template, and replace {{fields}} with specified values for a contact generation template.

I came across borb which seemed like a simple option compared to many other solutions (Many of which I've tried and failed)

ISSUE

Unfortunately, I'm getting an error in the SimpleFindReplace() function. The traceback suggests that the issue is an assertion in the canvas_stream_processor.py image

BACKGROUND

@anvil.server.callable()
def contracts_insert_fields_to_pdf(pdf_row, data_fields: list):
    pdf_media = pdf_row['contract_template']
    pdf_name = pdf_row['name']

    f_path_orig = f"/tmp/input_contract_{pdf_name.replace(' ', '_')}.pdf"
    print(F"fpath: {f_path_orig}")
    with open(f_path_orig, 'wb') as f:
        f.write(pdf_media.get_bytes())

    # attempt to read a PDF
    doc: typing.Optional[Document] = None
    with open(f_path_orig, "rb") as handle:
        doc = PDF.loads(handle)

    # check whether we actually read a PDF
    assert doc is not None

    # find/replace
    for field in data_fields:
        field_key = field[0]
        field_value = field[1]
        doc = SimpleFindReplace.sub(field_key, field_value, doc)

    # store
    f_path = f"/tmp/output_contract_{pdf_name.replace(' ', '_')}.pdf"

    with open(f_path, "wb") as handle:
        PDF.dumps(handle, doc)

    media_object = anvil.media.from_file(f_path, "application/pdf")
    return media_object

STACK TRACE

AssertionError: [unexpected error]
at /home/anvil/.env/lib/python3.10/site-packages/borb/pdf/canvas/canvas_stream_processor.py:279
called from /home/anvil/.env/lib/python3.10/site-packages/borb/toolkit/text/regular_expression_text_extraction.py:367
called from /home/anvil/.env/lib/python3.10/site-packages/borb/toolkit/text/simple_find_replace.py:60
called from ContractsModule, line 72

CLOSING

Any thoughts would be greatly appreciated! I've attached the document I'm using above (could be some issue with the formatting / compression that google docs uses when exporting pdfs)

I've found solutions which read and manipulate .docx files using Python rather than PDFs. However, I would like to avoid this if possible because I have already built the front-end using pdf's, and functions that allow me to search through the pdf and extract the {{fields}} to get user input. I would have to rebuild all of that from scratch if I switch file format.

Thanks in advance! Pat

pat-mw commented 1 year ago

After closer inspection, it seems that the error only happens for one of the test {{fields}}: {{project_date}}

MODIFIED LOOP (with print)

    # find/replace
    for field in data_fields:
        field_key = field[0]
        field_value = field[1]
        print(F"replacing {field_key} with {field_value}")
        doc = SimpleFindReplace.sub(field_key, field_value, doc)

CONSOLE OUTPUT

fpath: /tmp/input_contract_Project_Proposal.pdf
replacing {{project_name}} with borby
replacing {{project_date}} with borb
AssertionError: [unexpected error]
    at /home/anvil/.env/lib/python3.10/site-packages/borb/pdf/canvas/canvas_stream_processor.py:279
    called from /home/anvil/.env/lib/python3.10/site-packages/borb/toolkit/text/regular_expression_text_extraction.py:367
    called from /home/anvil/.env/lib/python3.10/site-packages/borb/toolkit/text/simple_find_replace.py:60
    called from ContractsModule, line 72

Any ideas why this may be happening with some but not all strings?

I'm starting to lose my mind a little xD

pat-mw commented 1 year ago

Just for an extra bit of context. The file is attached in the first post on this thread but here is a screenshot. There are only two fields in this document {{project_name}} - {{project_date}}

image

pat-mw commented 1 year ago

Second update:

Even though an Exception wasn't raised for the {{project_name}} field, it seems like the replacement didn't work.

The returned file, when downloaded, is the same as the input file

output file: output_contract_Project_Proposal (7).pdf

image

jorisschellekens commented 1 year ago

The error you are getting indicates something is wrong with the PDF being processed.

A PDF contains so called content streams. These are compressed pieces of code. This code contains instructions on how to render content.

You might get something like:

The error you are seeing is thrown when borb tries to process a content stream. It means borb has encountered an operator (for instance "set the active color to") but there were not enough operands on the stack.

This might have something to do with the way you are persisting the files from/to a temporary buffer. Maybe you don't have all the bytes yet?

I would try to reduce your example to its bare minimum. Just download the PDF to your local machine, and run a simple snippet of code that loafs the PDF from your drive and does SimpleFindReplace.

If that works, you'll know the error is in the io part.

pat-mw commented 1 year ago

I unfortunately don't have a local environment set up nor am I able to on the current setup I am using.

Would you possibly be able to test the simple example using the pdf file I provided?

Would be hugely useful, thank you!

pat-mw commented 1 year ago

I know that this isn't necessarily shutting out the option of it being the io - but in my pdf parsing function (where i am extracting the fields) - the pdf can be loaded without any issues by the fitz library and it correctly parses all of the text with no issues.

@anvil.server.callable()
def contracts_extract_fields_from_pdf(pdf_row):
    pdf_media = pdf_row['contract_template']
    pdf_name = pdf_row['name']
    f_path_orig = f"/tmp/input_contract_{pdf_name.replace(' ', '_')}.pdf"
    print(F"fpath: {f_path_orig}")
    with open(f_path_orig, 'wb') as f:
        f.write(pdf_media.get_bytes())

    # attempt to read a PDF
    with open(f_path_orig, "rb") as handle:
        doc = fitz.open(handle)

    all_text = ""
    number_of_pages = len(doc)
    for i, page in enumerate(doc):
        print(f"Parsing page {i+1}/{number_of_pages} of pdf {pdf_name}")
        text = page.get_text()
        all_text += text

    regex = r"({{).*(}})"
    matches = re.finditer(regex, all_text)
    unique_matches = []
    for match in matches:
        match_string = match.group()
        if match_string not in unique_matches:
            unique_matches.append(match_string)

    return unique_matches
pat-mw commented 1 year ago

I have also tried downloading the temp file and doing a 1-1 comparison with the input file using git, and there are no modifications - so as far as I can see, all the bytes are indeed present.

jorisschellekens commented 1 year ago

It seems to be the PDF. Unless I am completely wrong in interpreting the PDF spec.

This is what (part of) the content stream of your PDF looks like: Keep in mind the operators are postfix operators. So rather than seeing 1+2, you would see 1 2 +.

BT
/F4 24 Tf
1 0 0 -1 0 30.608002 Tm
0 -24.431999 Td
[<0036>] TJ
10.535965 0 Td
[<0053>] TJ
10.7279663 0 Td
[<0048>] TJ
10.2719727 0 Td
[<0046>] TJ
8.3999786 0 Td
[<004c>] TJ
/Span BDC
5.3999786 0 Td
[<02cb>] TJ
EMC

And this is a snippet from the PDF spec:

image

As you can see, the BDC operator ought to have 2 arguments. In this PDF, it only has one. That's the cause of the parsing problem.

pat-mw commented 1 year ago

Okay,

Appreciate your time but this is way over my head, must be something internal with google docs which I'm missing - I've looked for any export settings but can't find a way to configure this.

I've gone ahead and implemented the DOCX parsing instead which ended up being a much easier solution.

Thanks