py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.41k stars 1.42k forks source link

Multi-line pdf form text fields after version 3.9.0 truncates text and does not automatically wrap text to the next line in apple pdf previewer #2869

Closed jtay-pm closed 1 month ago

jtay-pm commented 1 month ago

Using an adobe pdf form template, a multi-line pdf text field has truncated text when previewed in apple default pdf preview. However it works in other pdf previewers. This is only an issue after pypdf > 3.9.1. When i pin pypdf to 3.9.1, the text wraps for all pdf previewers for a multi line form field.

Is it a known issue that latest versions of pypdf is not compatible with apple pdf previewer for multi line text fields?

Code + PDF

Comparing the text field differences from 4.2.0 to 3.9.1, i realised 4.2.0 adds this additional properties to the text field which 3.9.1 does not have - and this specifically causes the text to truncate and not automatically wrap to a new line...

/AP <<
/N 492 0 R
>>

Pdf text field generated with 4.2.0

image
<<
/DA (\057Lato\04010\040Tf\0400\040g)
/F 4
/FT /Tx
/Ff 8392704
/MK <<
>>
/P 55 0 R
/Rect [ 55.1379 531.632 569.251 687.238 ]
/Subtype /Widget
/T (policy\137insurance\137text)
/Type /Annot
/V (Lorem\040ipsum\040dolor\040sit\040amet\054\040consectetur\040adipiscing\040elit\054\040sed\040do\040eiusmod\040tempor\040incididunt\040ut\040labore\040et\040dolore\040magna\040aliqua\056\040Ut\040enim\040ad\040minim\040veniam\054\040quis\040nostrud\040exercitation\040ullamco\040laboris\040nisi\040ut\040aliquip\040ex\040ea\040commodo\040consequat\056\040Duis\040aute\040irure\040dolor\040in\040reprehenderit\040in\040voluptate\040velit\040esse\040cillum\040dolore\040eu\040fugiat\040nulla\040pariatur\056\040Excepteur\040sint\040occaecat\040cupidatat\040non\040proident\054\040sunt\040in\040culpa\040qui\040officia\040deserunt\040mollit\040anim\040id\040est\040laborum\056)
/AP <<
/N 492 0 R
>>
>>
endobj
347 0 obj

pdf text field generated with 3.9.1

image
<<
/DA (\057Lato\04010\040Tf\0400\040g)
/F 4
/FT /Tx
/Ff 8392704
/MK <<
>>
/P 55 0 R
/Rect [ 55.137900000000002 531.63199999999995 569.25099999999998 687.23800000000006 ]
/Subtype /Widget
/T (policy\137insurance\137text)
/Type /Annot
/V (Lorem\040ipsum\040dolor\040sit\040amet\054\040consectetur\040adipiscing\040elit\054\040sed\040do\040eiusmod\040tempor\040incididunt\040ut\040labore\040et\040dolore\040magna\040aliqua\056\040Ut\040enim\040ad\040minim\040veniam\054\040quis\040nostrud\040exercitation\040ullamco\040laboris\040nisi\040ut\040aliquip\040ex\040ea\040commodo\040consequat\056\040Duis\040aute\040irure\040dolor\040in\040reprehenderit\040in\040voluptate\040velit\040esse\040cillum\040dolore\040eu\040fugiat\040nulla\040pariatur\056\040Excepteur\040sint\040occaecat\040cupidatat\040non\040proident\054\040sunt\040in\040culpa\040qui\040officia\040deserunt\040mollit\040anim\040id\040est\040laborum\056)
>>
endobj
347 0 obj

Traceback

If i modifed the PdfWriter class provided by pypdf and removed the automatic annotations added for text field properties, the text field automatically wraps.. something in the automatic annotations added by the pdfwriter class is not compatible with apple pdf previewer

image

Specifically, commenting out the lines that adds the appearance stream to the field if AP is not defined in the dictionary object also resolves the issue with apple pdf previewers.. What's the benefit of adding this custom appearance stream? image

pubpub-zz commented 1 month ago

the ["/AP"]["/N"] stores the display as it is generated by the program which has added/modified the field. in PDF 1.7 this was optional image but in PDF 2.0, this has become mandatory: image

We have observed many cases where the viewers can not handle the display and requires this field. if you see this closed issue(#2756), it has been shown that manual CR/LF are working

Calling .set_need_appearances_writer(True) you will ask the viewer to regenerate the rendering but this may not work on all viewers.

Wrapping the text automatically is quite tough. Personnally I will not have time for this feature. Feel free to propose a PR.

jtay-pm commented 1 month ago

@pubpub-zz Instead of overriding the writer class, for the Tx form fields, i just dropped the Normal appearance added by the writer automatically after writer.update_page_form_field_values. Because our pdf templates already embeds the fonts properly in the pdf and every form field we have in our pdf has /DA property on it, so i didn't see why we need the /N added by the pdf writer.

This seems to work well when i opened the pdf in safari/chrome/adobe/apple preview. Is there something else i'm not considering?


    # update each page with the data mapping
    for page in writer.pages:
      writer.update_page_form_field_values(page, self.data_dict)
      for annotation in page.annotations:
        annotation = annotation.get_object()
        # Form fields are of type: widgets
        is_annotation_sub_type_widget = annotation.get(
          AnnotationDictionaryAttributes.Subtype) == ANNOTATION_SUBTYPE.WIDGET.value
        if is_annotation_sub_type_widget:
          if annotation.get(AnnotationDictionaryAttributes.FT) == "/Tx":  # Text field type
            # Remove the normal appearance dictionary
            if "/AP" in annotation:
              print(f"Removing appearance override for field: {annotation.get('/T')}")
              del annotation["/AP"]["/N"]  # This removes the entire appearance dictionary
              print(f"Normal appearance removed: {annotation.get('/AP')}")
pubpub-zz commented 1 month ago

This solution works for you but can not be considered as valid for all viewers

pubpub-zz commented 1 month ago

Can we close this issue ?