py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.08k stars 1.39k forks source link

XFA fields not updated when using update_page_form_field_values() #2824

Open pubpub-zz opened 2 weeks ago

pubpub-zz commented 2 weeks ago

Environment

Python 3.10 pypdf 4.3.1+dev on sept,1st

Code + PDF

cf #2780 When modifying a form with XFA form, the fields in the XFA dataset are not modified

ljbergmann commented 2 weeks ago

So for my use case i found a solution by "just" parsing the xfa:dataset xml and setting the values and saving the XML string back, the question is: is that a valid approach for every XFA form or not? If that approach is valid, I'll gladly write a PR that enhances the update_page_form_field_values method or implement an additional method to accomplish this. But I'm not quite sure if my approach is more than a shortcut.

pubpub-zz commented 2 weeks ago

Just working on the xfa will not allow standard tools to extract data from the fields information. My idea is just to extend the existing update_form_fields to also update xfa dataset if it exists

ljbergmann commented 2 weeks ago

I identified something very interesting during the implementation of the proposed extension of update_form_fields.

The XFA "keys" of fields are different then the names used by pypdf in AcroForm. To verify i created this pypdf_field_name_test.pdf . As you can clearly see in this screenshot the field is called F1. grafik

If you check the key provided by pypdf you can see that it is 'F1[0]'. You can check with the code below.

from pypdf import PdfReader

reader = PdfReader("pypdf_field_name_test.pdf")
fields = reader.get_form_text_fields()

print(fields)

{'F1[0]': None}

If you look at the XFA template / dataset xml the field is name F1.

<template xmlns="http://www.xfa.org/schema/xfa-template/3.3/"><?formServer defaultPDFRenderFormat acrobat10.0dynamic?>
    <subform name="form1" layout="tb" locale="de_DE" restoreState="auto">
        <pageSet>
            <pageArea name="Page1" id="Page1">
                <contentArea x="0.25in" y="0.25in" w="197.3mm" h="284.3mm"/>
                <medium stock="a4" short="210mm" long="297mm"/><?templateDesigner expand 1?>
            </pageArea><?templateDesigner expand 1?>
        </pageSet>
        <subform w="197.3mm" h="284.3mm" name="topform">
            <field name="F1" y="12.7mm" x="41.275mm" w="130.175mm" h="9mm">
                <ui>
                    <textEdit>
                        <border>
                            <edge stroke="lowered"/>
                        </border>
                        <margin/>
                    </textEdit>
                </ui>
                <font typeface="Arial"/>
                <para vAlign="middle"/>
                <caption>
                    <para vAlign="middle"/>
                    <value>
                        <text>This is test of pypdf field names</text>
                    </value>
                </caption>
            </field><?templateDesigner expand 1?>
        </subform>
        <proto/>
        <desc>
            <text name="version">11.0.9.20240701.1.52.2</text>
        </desc><?templateDesigner expand 1?><?renderCache.subset "Arial" 0 0 ISO-8859-1 4 72 18 0003002900370044004700480049004B004C004F005000510052005300560057005B005C FTadefhilmnopstxy?>
    </subform><?templateDesigner DefaultPreviewDynamic 1?><?templateDesigner DefaultRunAt client?><?templateDesigner FormTargetVersion 33?><?templateDesigner DefaultCaptionFontSettings face:Arial;size:10;weight:normal;style:normal?><?templateDesigner DefaultValueFontSettings face:Arial;size:10;weight:normal;style:normal?><?templateDesigner DefaultLanguage JavaScript?><?acrobat JavaScript strictScoping?><?templateDesigner Rulers horizontal:1, vertical:1, guidelines:1, crosshairs:0?><?templateDesigner Zoom 190?><?templateDesigner WidowOrphanControl 0?><?templateDesigner SaveTaggedPDF 1?><?templateDesigner SavePDFWithEmbeddedFonts 1?><?templateDesigner Grid show:1, snap:1, units:0, color:ff8080, origin:(0,0), interval:(125000,125000), objsnap:0, guidesnap:0, pagecentersnap:0?>
</template>

I suspect that the naming of the fields with [0] was a deliberate choice made in the implementation.

The questions that arises now: shouldn't the names in the XFA and the AcroForm be identical and if not, would the removal of the [0] to update the XFA be an valid approach?

In my opinion the names of fields should be consistent and therefor the AcroForm names should not contain [0].

Best regards, Leon

pubpub-zz commented 2 weeks ago

some information are provided in https://pdfa.org/norm-refs/XFA-3_3.pdf

looking at "Field names" page 72++