pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.62k stars 524 forks source link

"Yes" for all checkboxes does not work for all PDF rendering engines. #4055

Open sarahkittyy opened 22 hours ago

sarahkittyy commented 22 hours ago

Description of the bug

Ran into this tricky issue. It's not true that /Yes is a valid ON state for every single checkbox widget in PDFs, like as implemented / specified in the docs.

Rather, 99% of PDF rendering engines automatically take "Yes" as an acceptable On state, even if it is not specified in the /AP/N dict of the widget.

Try iterating over all widgets in a PDF with checkboxes, like any IRS tax form and set them to "Yes". It will render correctly in browsers, but then try rendering it in SodaPDF's online editor, and the checkmarks will not appear. It's an example of a renderer that does not assume "Yes" is a valid default on state.

For my form specifically, a valid yes state is actually "/V /5". I tried setting widget.field_value = '5' and also widget.field_value = '/5' but there is some internal code that is automatically changing this value to OFF, and not respecting my input. widget.on_state() does not work since it just provides Yes and not '5' like I need.

Then I tried manually setting the XREF like pdf.xref_set_key(widget.xref, 'V', '5'). But after a widget.update(), this is changed. And without calling widget.update(), this xref change is not reflected in the resulting saved document.

In other pdf libraries like pdfrw, the change to /V is respected and checking the checkbox works as expected.

This is clearly a bug and the widget update() method needs to respect the value present in widget.field_value. In the mean time, do you have any recommendations to update a widget in the full document without calling widget.update()? Like if I do pdf.xref_set_key to change the widget's /V value manually, how can I have that change remain persistent until I call pdf.save()?

Thank you.

How to reproduce the bug

doc = pymupdf.open('f7004.pdf') # irs tax form 7004, any pdf with checkboxes will do
for page in doc:
  for widget in page.widgets():
    if widget.field_type == pymupdf.PDF_WIDGET_TYPE_CHECKBOX:
      print(pdf.xref_get_key(widget.xref, 'V')) # off state /Off
      widget.field_value = widget.button_states().get('down')[0] # for me, this is '5' or whatever. should be the set state.
      widget.update()
      print(pdf.xref_get_key(widget.xref, 'V')) # still /Off

PyMuPDF version

1.24.13

Operating system

Linux

Python version

3.12

JorjMcKie commented 7 hours ago

Without a reproducing file we cannot deal with this post. In this case we need an example where a "non-/Yes" value is provided which we do not handle as "ON". Supporting non-/Yes as ON on input is an enhancement request, not a bug: the PDF spec clearly recommends using /Yes as the ON value: image