Open rdhyee opened 2 days ago
You are incorrectly accessing widgets after end-of-life of the owning page. The cause for the crash is that the code does not properly detect and prevent this logic error.
I understand your intention, but you must modify your approach. Do not store the widget object itself in any way. Store its properties (like name, value, xref, etc.) and its owning page number (also here: not the page object!) if this is required. Before updating, first load the page, then the desired field (via its xref), then change widget properties and update.
@JorjMcKie Thank you so much for responding to my issue and for sketching a proper approach. I see that you are working on a fix that would have caused my code to throw an error rather than to segfault. I'll now read through the docs to translate your hints into code.
Using the feedback from @JorjMcKie , here's what I came up with help from some code-writing AI
import pymupdf as pmp
from collections import defaultdict
from typing import Dict, List, Any, Optional
def get_widgets_info(doc: pmp.Document) -> Dict[str, List[Dict[str, Any]]]:
"""
Extracts and returns a dictionary of widget information indexed by their names.
Args:
doc: PyMuPDF document object
Returns:
Dictionary mapping field names to lists of widget information dictionaries
"""
widgets_by_name = defaultdict(list)
for page_num in range(len(doc)):
page = doc.load_page(page_num)
for widget in page.widgets():
widgets_by_name[widget.field_name].append({
"page_num": page_num,
"xref": widget.xref,
"field_type": widget.field_type,
"field_value": widget.field_value,
"rect": widget.rect
})
return widgets_by_name
def update_widget_value(doc: pmp.Document, page_num: int, xref: int, new_value: str) -> bool:
"""
Safely updates a widget's value by reloading the page and widget.
Args:
doc: PyMuPDF document object
page_num: Page number containing the widget
xref: Cross-reference number of the widget
new_value: New value to set for the widget
Returns:
True if widget was successfully updated, False otherwise
"""
try:
page = doc.load_page(page_num)
for widget in page.widgets():
if widget.xref == xref:
widget.field_value = new_value
widget.update()
return True
return False
except Exception as e:
print(f"Error updating widget: {e}")
return False
def main():
"""Main function to process the PDF form"""
try:
# Open document and get widgets info
doc = pmp.open("simple_form.pdf")
widgets_info = get_widgets_info(doc)
# Print widget information
for name, widgets in widgets_info.items():
print(f"Widget Name: {name}")
for widget_info in widgets:
print(f" Page: {widget_info['page_num'] + 1}, "
f"Type: {widget_info['field_type']}, "
f"Value: {widget_info['field_value']}, "
f"Rect: {widget_info['rect']}")
# Update field value safely
if "Text1" in widgets_info and widgets_info["Text1"]:
widget_info = widgets_info["Text1"][0]
success = update_widget_value(
doc,
widget_info["page_num"],
widget_info["xref"],
"1234567890"
)
if success:
print("Widget updated successfully")
doc.save("simple_form_filled.pdf", garbage=4, deflate=True)
else:
print("Failed to update widget")
except Exception as e:
print(f"Error processing PDF: {e}")
finally:
if 'doc' in locals():
doc.close()
if __name__ == "__main__":
main()
Fast reaction!
Still one suggestion: simply load the widget directly: widget = page.load_widget(xref)
. No need to iterate ...
@JorjMcKie Fast reaction because I'm so excited that you responded to my cry for help so quickly -- and you got me unstuck! I ran into the segfault almost two weeks ago and only just got around to posting the issue yesterday. I'm so happy to be able to use PyMuPDF (along with PyPDF).
Thanks also for the telling me that I can load the widget directly.
My only question is: what for do you still need pypdf (🤷♂️😉)?
BTW thanks for the report: it pointed us to an open problem!
@JorjMcKie I started with PyMuPDF because I had read that it was the modern, fast library. After I ran into the segfault, I turned to PyPDF with the hope of eventually returning to PyMuPDF. So here I am.
One issue I couldn't get working with PyPDF is renaming widgets tied to the same field name into different names. I still haven't been able to successfully delete widgets using PyPDF. I'm hoping that I'll be able to use PyMuPDF to solve this problem.
@JorjMcKie I started with PyMuPDF because I had read that it was the modern, fast library. After I ran into the segfault, I turned to PyPDF with the hope of eventually returning to PyMuPDF. So here I am.
One issue I couldn't get working with PyPDF is renaming widgets tied to the same field name into different names. I still haven't been able to successfully delete widgets using PyPDF. I'm hoping that I'll be able to use PyMuPDF to solve this problem.
You can delete widgets with PyMuPDF. Renaming is non-trivial, because field names can belong to a hierarchy like "name1.name2.name3". Easy to imagine in which problems you run when you want to rename "name2". All is lower level kids must be adjusted and uniqueness throughout the full document must be guaranteed in addition ...
@JorjMcKie I'm a newbie when it comes to programmatically manipulating PDF files. One unpleasant surprise for me has been how fragile Adobe Acrobat Pro has been for editing form elements. I've been changing names and adding widgets and suddenly, the resulting file is corrupted and I loose all my edits. How can Adobe Acrobat, software that should be the closest to the canonical software for working with PDFs be so junky? I started out trying to use the JS programmatic interface in Acrobat to manipulate the PDF but have abandoned that approach. Happy to be digging into PyMuPDF now.
In PyMuPDF git, we now have a fix for the underlying SEGV. If an annotation is unbound from its parent page (for example if the pymupdf.Page
object is deleted), and then one attempts an operation on the annotation that requires the page, we now raise a Python exception.
Unfortunately the fix requires a new release of MuPDF. So depending on MuPDF release timescales, it might not be in the next release pf PyMuPDF.
Description of the bug
When attempting to update a PDF form field value using
widget.update()
, the application crashes with a segmentation fault. The crash occurs specifically in the PDF annotation rectangle handling code.How to reproduce the bug
simple_form.pdf
) simple_form.pdfRun the following code:
output of program:
Current Behavior
The program crashes with a segmentation fault when calling
field.update()
. The crash occurs in the PDF annotation rectangle handling code.Crash Details
Stack trace from
fault.log
:The crash trace indicates the following call chain:
widget.update()
_save_widget()
JM_set_widget_properties()
pdf_set_annot_rect()
Additional Context
Rect(172.80099487304688, 117.16400146484375, 322.8009948730469, 139.16400146484375)
See detailed crash from
Console.app
:PyMuPDF version
1.24.13
Operating system
MacOS
Python version
3.12