py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.05k stars 1.39k forks source link

`update_page_form_field_values` fails on pdf with same field on multiple pages. #2234

Closed oeble closed 5 months ago

oeble commented 11 months ago

I have this pdf file with some fields duplicated on multiple pages. When I try to fill any of those fields (for example, "n et p" using update_page_form_field_values, it fails with KeyError: '/AP'.

My wild guess is that it is because update_page_form_field_values takes one page to update while the same field is duplicated multiple times over the whole document.

Side note: pdftk handles this well, but I'm looking for a native Python solution.

Environment

$ python -m platform
Windows-10-10.0.22631-SP0

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.16.2, crypt_provider=('cryptography', '41.0.4'), PIL=none

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader, PdfWriter

reader = PdfReader("634_empty.pdf")
writer = PdfWriter()

# Fill the PDF
writer.append(reader)
fields = reader.get_fields()

page_1 = {
    "n et p": "test",
}

writer.update_page_form_field_values(writer.pages[1], page_1)

with open("test_output.pdf", "wb") as output_stream:
    writer.write(output_stream)

I'm sharing the pdf file that causes the issue, but I'm not the author, so I don't think it can be included in tests.

Traceback

This is the complete (redacted) Traceback I see:

Traceback (most recent call last):
  File "C:\..\dap_form.py", line 83, in validate_data
    fill_dap_pdf(v, "test_output.pdf")
  File "C:\..\dap_generate.py", line 45, in fill_dap_pdf
    writer.update_page_form_field_values(writer.pages[1], page_1)
  File "C:\..\venv\Lib\site-packages\pypdf\_writer.py", line 1072, in update_page_form_field_values
    value if value in k[AA.AP]["/N"] else "/Off"
                      ~^^^^^^^
  File "C:\..\venv\Lib\site-packages\pypdf\generic\_data_structures.py", line 320, in __getitem__
    return dict.__getitem__(self, key).get_object()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: '/AP'
pubpub-zz commented 11 months ago

your form uses a special field which is synchronized between multiple pages. Thanks for the example. However it will need a little of time to find a fix

pcraciunoiu commented 9 months ago

I just ran into a similar issue on a single page PDF where the same field is repeated.

This try/catch fixes it, but it's probably not the right way to go about it.

https://github.com/py-pdf/pypdf/pull/2333

antonio-cinnamon commented 5 months ago

A hacky way to make it work is going into adobe acrobat and renaming the fields to include a suffix.

For example mine wasn't working when the field was called USER_NAME but when I changed it to USER_NAME_1 it started working and I wasn't even using the field in multiple pages but I was having the same issue.

pubpub-zz commented 5 months ago

A hacky way to make it work is going into adobe acrobat and renaming the fields to include a suffix.

For example mine wasn't working when the field was called USER_NAME but when I changed it to USER_NAME_1 it started working and I wasn't even using the field in multiple pages but I was having the same issue.

As said having Annotation/Widget, refering to the same field is a is normal : it allows to report the filled data on multiple pages. The solution you are proposing consists in building new fields with new names.

The pdf inhere has not been built properly as it has duplicated fields with same names and not attaching them properly. The extra forms should have been first prepared adding a grouping field as stated in the documentation: https://pypdf.readthedocs.io/en/stable/user/merging-pdfs.html#merging-forms

@antonio-cinnamon / @oeble have a try

oeble commented 5 months ago

Thx. I'll look into this.

ReedGraff commented 5 months ago

test1.pdf test2.pdf

A hacky way to make it work is going into adobe acrobat and renaming the fields to include a suffix. For example mine wasn't working when the field was called USER_NAME but when I changed it to USER_NAME_1 it started working and I wasn't even using the field in multiple pages but I was having the same issue.

As said having Annotation/Widget, refering to the same field is a is normal : it allows to report the filled data on multiple pages. The solution you are proposing consists in building new fields with new names.

The pdf inhere has not been built properly as it has duplicated fields with same names and not attaching them properly. The extra forms should have been first prepared adding a grouping field as stated in the documentation: https://pypdf.readthedocs.io/en/stable/user/merging-pdfs.html#merging-forms

@antonio-cinnamon / @oeble have a try

Can you elaborate on the usage of this or provide an example? I have played around with reader.add_form_topname("form1"), but have yet to be able to use it to solve this issue without discovering more.

Below is my specific usage thus far:

from pypdf import PdfReader, PdfWriter

myFiles = {
    "test1": {
        "name": "Test1 Form",
        "path": "test1.pdf",
        "usage": {
            "fields": {
                "First Name": "Reed",
                "Middle Name": "R",
                "MM": "04",
                "DD": "21",
                "YY": "24",
                "Initial": "RRG",
                # "I DO NOT Agree": null,
                # "Last Name": null
            },
        }
    },
    "test2": {
        "name": "Test2 Form",
        "path": "test2.pdf",
        "usage": {
            "fields": {
                "p2 First Name": "Joe",
                "p2 Middle Name": "S",
                "p2 MM": "03",
                "p2 DD": "31",
                "p2 YY": "24",
                "Initial": "JSS",
                # "p2 I DO NOT Agree": "null",
                "p2 Last Name": "Smith",
                "p3 First Name": "John",
                "p3 Middle Name": "R",
                "p3 MM": "01",
                "p3 DD": "25",
                "p3 YY": "21"
            },
        }
    }
}

pdfOut = "merged.pdf"
merger = PdfWriter()

for file in myFiles:
    reader = PdfReader(myFiles[file]["path"])
    reader.add_form_topname(file)
    writer = PdfWriter()
    writer.append(reader)

    # Update form fields for each page in the current PDF
    for page in range(len(reader.pages)):
        writer.update_page_form_field_values(
            writer.pages[page],
            myFiles[file]["usage"]["fields"]
        )

    # Append the pages directly to the final_writer
    for page in writer.pages:
        merger.add_page(page)

# Write the merged PDF to the output file
with open(pdfOut, "wb") as f:
    merger.write(f)

In this, I am iterating through a dictionary of documents, filling these required documents, and then merging all of the required documents I have. I get this result because I am unfamiliar with how to use the aforementioned function add_form_topname:

Traceback (most recent call last):
  File "C:\Users\range\CodingProjects\RGBZ\Aeri4l\AllofPermitFly\PermitFlyHelper\functions\functions\standalone.py", line 54, in <module>
    writer.update_page_form_field_values(
  File "C:\Users\range\CodingProjects\RGBZ\Aeri4l\AllofPermitFly\PermitFlyHelper\functions\functions\venv\lib\site-packages\pypdf\_writer.py", line 955, in update_page_form_field_values
    value if value in k[AA.AP]["/N"] else "/Off"
  File "C:\Users\range\CodingProjects\RGBZ\Aeri4l\AllofPermitFly\PermitFlyHelper\functions\functions\venv\lib\site-packages\pypdf\generic\_data_structures.py", line 319, in __getitem__ 
    return dict.__getitem__(self, key).get_object()
KeyError: '/AP'
pubpub-zz commented 5 months ago

@ReedGraff Can you complete your code in order to be fully self carrying (merger is never declared) also remember that working with forms does not allow to work with add_page() ; you have to copy/duplicate both pages but also /AcroForm section ; in order to do that you need to use append (and possibly using pages parameters to define a partial set of pages can you also provide your output result

ReedGraff commented 5 months ago

@ReedGraff Can you complete your code in order to be fully self carrying (merger is never declared) also remember that working with forms does not allow to work with add_page() ; you have to copy/duplicate both pages but also /AcroForm section ; in order to do that you need to use append (and possibly using pages parameters to define a partial set of pages can you also provide your output result

I have updated the previous message, Happy Easter!

ReedGraff commented 5 months ago

this is possible with another library, which is my solution at the moment:

from pdfrw import PdfReader, PdfDict, PdfName, PdfObject, PdfWriter

ANNOT_KEY = '/Annots'
ANNOT_FIELD_KEY = '/T'
ANNOT_VAL_KEY = '/V'
ANNOT_RECT_KEY = '/Rect'
SUBTYPE_KEY = '/Subtype'
WIDGET_SUBTYPE_KEY = '/Widget'

# ....

        pdfOut = "/tmp/merged.pdf"

        fields = ()
        writer = PdfWriter()
        for file in my_instance._uniformRequirements:
            pages = PdfReader("storage/" + my_instance._uniformRequirements[file]["path"]).pages

            for page in pages:
                annotations = page["/Annots"]
                for annotation in annotations:
                    if annotation[SUBTYPE_KEY] == WIDGET_SUBTYPE_KEY:
                        if annotation[ANNOT_FIELD_KEY]:
                            key = annotation[ANNOT_FIELD_KEY][1:-1]
                            # annotation.update(PdfDict(T='CHANGED ' + key))

                            if key in my_instance._uniformRequirements[file]["usage"]["fields"]:
                                if key in fields:
                                    key = "_" + key
                                    annotation.update(PdfDict(T=key))

                                annotation.update(PdfDict(V=my_instance._uniformRequirements[file]["usage"]["fields"][key.lstrip('_')]))
                                annotation.update(PdfDict(AP=''))
                                # print(key)
                                fields += (key,)

            writer.addpages(pages)
        writer.write(pdfOut)
pubpub-zz commented 5 months ago

I've prepared a PR to fix this issue. I've also reviewed/improved the test code:

from pypdf import PdfReader, PdfWriter

myFiles = {
    "test1": {
        "name": "Test1 Form",
        "path": "test1.pdf",
        "usage": {
            "fields": {
                "First Name": "Reed",
                "Middle Name": "R",
                "MM": "04",
                "DD": "21",
                "YY": "24",
                "Initial": "RRG",
                # "I DO NOT Agree": null,
                # "Last Name": null
            },
        }
    },
    "test2": {
        "name": "Test2 Form",
        "path": "test2-1.pdf",
        "usage": {
            "fields": {
                "p2 First Name": "Joe",
                "p2 Middle Name": "S",
                "p2 MM": "03",
                "p2 DD": "31",
                "p2 YY": "24",
                "Initial": "JSS",
                # "p2 I DO NOT Agree": "null",
                "p2 Last Name": "Smith",
                "p3 First Name": "John",
                "p3 Middle Name": "R",
                "p3 MM": "01",
                "p3 DD": "25",
                "p3 YY": "21"
            },
        }
    }
}

pdfOut = "merged2.pdf"
merger = PdfWriter()

for file in myFiles:
    print(file)
    reader = PdfReader(myFiles[file]["path"])
    reader.add_form_topname(file)
    writer = PdfWriter(clone_from=reader)

    # Update form fields for each page in the current PDF
    for page in writer.pages:
        print("page",page.page_number)
        writer.update_page_form_field_values(
            page,
            myFiles[file]["usage"]["fields"],
            auto_regenerate = False
        )

    merger.append(writer)

# Write the merged PDF to the output file
merger.write(pdfOut )
pcraciunoiu commented 5 months ago

Thanks for working on this!

Just following up to see if I understand, does #2570 make it so that a field with the same name is filled in across all places, or does it only fill in the first value?

pubpub-zz commented 5 months ago

Thanks for working on this!

Just following up to see if I understand, does #2570 make it so that a field with the same name is filled in across all places, or does it only fill in the first value?

It should modify all "display"(annotations) that refers to the field that way. The only point is to be sure that the include all pages. For this purpose I recommend to wait for #2571 which will be easier (using page=None ) to update all pages