Closed pat-mw closed 1 year ago
After closer inspection, it seems that the error only happens for one of the test {{fields}}: {{project_date}}
MODIFIED LOOP (with print)
# find/replace
for field in data_fields:
field_key = field[0]
field_value = field[1]
print(F"replacing {field_key} with {field_value}")
doc = SimpleFindReplace.sub(field_key, field_value, doc)
CONSOLE OUTPUT
fpath: /tmp/input_contract_Project_Proposal.pdf
replacing {{project_name}} with borby
replacing {{project_date}} with borb
AssertionError: [unexpected error]
at /home/anvil/.env/lib/python3.10/site-packages/borb/pdf/canvas/canvas_stream_processor.py:279
called from /home/anvil/.env/lib/python3.10/site-packages/borb/toolkit/text/regular_expression_text_extraction.py:367
called from /home/anvil/.env/lib/python3.10/site-packages/borb/toolkit/text/simple_find_replace.py:60
called from ContractsModule, line 72
Any ideas why this may be happening with some but not all strings?
I'm starting to lose my mind a little xD
Just for an extra bit of context. The file is attached in the first post on this thread but here is a screenshot.
There are only two fields in this document {{project_name}}
- {{project_date}}
Second update:
Even though an Exception wasn't raised for the {{project_name}}
field, it seems like the replacement didn't work.
The returned file, when downloaded, is the same as the input file
output file: output_contract_Project_Proposal (7).pdf
The error you are getting indicates something is wrong with the PDF being processed.
A PDF contains so called content streams. These are compressed pieces of code. This code contains instructions on how to render content.
You might get something like:
The error you are seeing is thrown when borb tries to process a content stream. It means borb has encountered an operator (for instance "set the active color to") but there were not enough operands on the stack.
This might have something to do with the way you are persisting the files from/to a temporary buffer. Maybe you don't have all the bytes yet?
I would try to reduce your example to its bare minimum. Just download the PDF to your local machine, and run a simple snippet of code that loafs the PDF from your drive and does SimpleFindReplace.
If that works, you'll know the error is in the io part.
I unfortunately don't have a local environment set up nor am I able to on the current setup I am using.
Would you possibly be able to test the simple example using the pdf file I provided?
Would be hugely useful, thank you!
I know that this isn't necessarily shutting out the option of it being the io - but in my pdf parsing function (where i am extracting the fields) - the pdf can be loaded without any issues by the fitz
library and it correctly parses all of the text with no issues.
@anvil.server.callable()
def contracts_extract_fields_from_pdf(pdf_row):
pdf_media = pdf_row['contract_template']
pdf_name = pdf_row['name']
f_path_orig = f"/tmp/input_contract_{pdf_name.replace(' ', '_')}.pdf"
print(F"fpath: {f_path_orig}")
with open(f_path_orig, 'wb') as f:
f.write(pdf_media.get_bytes())
# attempt to read a PDF
with open(f_path_orig, "rb") as handle:
doc = fitz.open(handle)
all_text = ""
number_of_pages = len(doc)
for i, page in enumerate(doc):
print(f"Parsing page {i+1}/{number_of_pages} of pdf {pdf_name}")
text = page.get_text()
all_text += text
regex = r"({{).*(}})"
matches = re.finditer(regex, all_text)
unique_matches = []
for match in matches:
match_string = match.group()
if match_string not in unique_matches:
unique_matches.append(match_string)
return unique_matches
I have also tried downloading the temp file and doing a 1-1 comparison with the input file using git, and there are no modifications - so as far as I can see, all the bytes are indeed present.
It seems to be the PDF. Unless I am completely wrong in interpreting the PDF spec.
This is what (part of) the content stream of your PDF looks like:
Keep in mind the operators are postfix operators.
So rather than seeing 1+2
, you would see 1 2 +
.
BT
/F4 24 Tf
1 0 0 -1 0 30.608002 Tm
0 -24.431999 Td
[<0036>] TJ
10.535965 0 Td
[<0053>] TJ
10.7279663 0 Td
[<0048>] TJ
10.2719727 0 Td
[<0046>] TJ
8.3999786 0 Td
[<004c>] TJ
/Span BDC
5.3999786 0 Td
[<02cb>] TJ
EMC
And this is a snippet from the PDF spec:
As you can see, the BDC operator ought to have 2 arguments. In this PDF, it only has one. That's the cause of the parsing problem.
Okay,
Appreciate your time but this is way over my head, must be something internal with google docs which I'm missing - I've looked for any export settings but can't find a way to configure this.
I've gone ahead and implemented the DOCX parsing instead which ended up being a much easier solution.
Thanks
Hey there!
I've been bashing my head around stack overflow trying to figure out the best way to load a pdf template, and replace {{fields}} with specified values for a contact generation template.
I came across borb which seemed like a simple option compared to many other solutions (Many of which I've tried and failed)
ISSUE
Unfortunately, I'm getting an error in the
SimpleFindReplace()
function. The traceback suggests that the issue is an assertion in the canvas_stream_processor.pyBACKGROUND
I'm using a test pdf generated from google docs: Project proposal (2).pdf
I'm using a Python3.10 environment within the Anvil framework (https://anvil.works) - which uses a linux environment
My code is adapted from one of the threads I saw regarding borb: ---> the first part simply serialises the pdf object to a temporary file, so that it can be read by borb using a file path) ---> the final part re-build the 'anvil media object' which is a live pdf that is returned to the front-end to be downloaded. ---> the
data_fields
object is a list of tuples in the format("{{field_to_replace}}", "replacement")
---> The error arises in the linedoc = SimpleFindReplace.sub(field_key, field_value, doc)
---> (Stack trace is below the code)STACK TRACE
CLOSING
Any thoughts would be greatly appreciated! I've attached the document I'm using above (could be some issue with the formatting / compression that google docs uses when exporting pdfs)
I've found solutions which read and manipulate
.docx
files using Python rather than PDFs. However, I would like to avoid this if possible because I have already built the front-end using pdf's, and functions that allow me to search through the pdf and extract the{{fields}}
to get user input. I would have to rebuild all of that from scratch if I switch file format.Thanks in advance! Pat