Open pubpub-zz opened 2 weeks ago
So for my use case i found a solution by "just" parsing the xfa:dataset xml and setting the values and saving the XML string back, the question is: is that a valid approach for every XFA form or not? If that approach is valid, I'll gladly write a PR that enhances the update_page_form_field_values method or implement an additional method to accomplish this. But I'm not quite sure if my approach is more than a shortcut.
Just working on the xfa will not allow standard tools to extract data from the fields information. My idea is just to extend the existing update_form_fields to also update xfa dataset if it exists
I identified something very interesting during the implementation of the proposed extension of update_form_fields.
The XFA "keys" of fields are different then the names used by pypdf in AcroForm. To verify i created this pypdf_field_name_test.pdf . As you can clearly see in this screenshot the field is called F1.
If you check the key provided by pypdf you can see that it is 'F1[0]'. You can check with the code below.
from pypdf import PdfReader
reader = PdfReader("pypdf_field_name_test.pdf")
fields = reader.get_form_text_fields()
print(fields)
{'F1[0]': None}
If you look at the XFA template / dataset xml the field is name F1
.
<template xmlns="http://www.xfa.org/schema/xfa-template/3.3/"><?formServer defaultPDFRenderFormat acrobat10.0dynamic?>
<subform name="form1" layout="tb" locale="de_DE" restoreState="auto">
<pageSet>
<pageArea name="Page1" id="Page1">
<contentArea x="0.25in" y="0.25in" w="197.3mm" h="284.3mm"/>
<medium stock="a4" short="210mm" long="297mm"/><?templateDesigner expand 1?>
</pageArea><?templateDesigner expand 1?>
</pageSet>
<subform w="197.3mm" h="284.3mm" name="topform">
<field name="F1" y="12.7mm" x="41.275mm" w="130.175mm" h="9mm">
<ui>
<textEdit>
<border>
<edge stroke="lowered"/>
</border>
<margin/>
</textEdit>
</ui>
<font typeface="Arial"/>
<para vAlign="middle"/>
<caption>
<para vAlign="middle"/>
<value>
<text>This is test of pypdf field names</text>
</value>
</caption>
</field><?templateDesigner expand 1?>
</subform>
<proto/>
<desc>
<text name="version">11.0.9.20240701.1.52.2</text>
</desc><?templateDesigner expand 1?><?renderCache.subset "Arial" 0 0 ISO-8859-1 4 72 18 0003002900370044004700480049004B004C004F005000510052005300560057005B005C FTadefhilmnopstxy?>
</subform><?templateDesigner DefaultPreviewDynamic 1?><?templateDesigner DefaultRunAt client?><?templateDesigner FormTargetVersion 33?><?templateDesigner DefaultCaptionFontSettings face:Arial;size:10;weight:normal;style:normal?><?templateDesigner DefaultValueFontSettings face:Arial;size:10;weight:normal;style:normal?><?templateDesigner DefaultLanguage JavaScript?><?acrobat JavaScript strictScoping?><?templateDesigner Rulers horizontal:1, vertical:1, guidelines:1, crosshairs:0?><?templateDesigner Zoom 190?><?templateDesigner WidowOrphanControl 0?><?templateDesigner SaveTaggedPDF 1?><?templateDesigner SavePDFWithEmbeddedFonts 1?><?templateDesigner Grid show:1, snap:1, units:0, color:ff8080, origin:(0,0), interval:(125000,125000), objsnap:0, guidesnap:0, pagecentersnap:0?>
</template>
I suspect that the naming of the fields with [0]
was a deliberate choice made in the implementation.
The questions that arises now: shouldn't the names in the XFA and the AcroForm be identical and if not, would the removal of the [0]
to update the XFA be an valid approach?
In my opinion the names of fields should be consistent and therefor the AcroForm names should not contain [0]
.
Best regards, Leon
some information are provided in https://pdfa.org/norm-refs/XFA-3_3.pdf
looking at "Field names" page 72++
Environment
Python 3.10 pypdf 4.3.1+dev on sept,1st
Code + PDF
cf #2780 When modifying a form with XFA form, the fields in the XFA dataset are not modified