mozilla / pdf.js

PDF Reader in JavaScript
https://mozilla.github.io/pdf.js/
Apache License 2.0
48.35k stars 9.97k forks source link

XFA form saved using pdf.js cannot be re-opened with pdf.js - "Warning: XFA Foreground documents are not supported" #14249

Closed kwisatz closed 2 years ago

kwisatz commented 2 years ago

Attach (recommended) or Link to PDF file here: CIE-XFA-work.pdf

Configuration:

Steps to reproduce the problem:

  1. Open the attached form in pdf.js
  2. Fill out a field
  3. Save/Download the form
  4. Re-open the form

What is the expected behavior? (add screenshot)

The form should render again, displaying the field with filled in value.

What went wrong? (add screenshot)

The form only renders the first time. Saving it with at least one field filled and re-opening the saved PDF fails with Warning: XFA Foreground documents are not supported

Additional info

Saving/Downloading the form without filling a field does not produce the error. I have also tested other PDFs (e.g. canadian-xfa-example.pdf) that do not exhibit this problem. From what I can see in the code, it would appear that saving the form somehow turns it from being pureXfa to losing that characteristic. https://github.com/mozilla/pdf.js/blob/c68dc03be685a5f2de5c2e99595f9bc747ffaa34/web/app.js#L1572

Snuffleupagus commented 2 years ago

The issue here is that saving it somehow corrupts the XFA data, causing even e.g. Adobe Reader to refuse to open it because of XML errors. In PDF.js, when opening the re-saved document, the following warning is printed in the console: Warning: XFA - Invalid utf-8 string.

kwisatz commented 2 years ago

Indeed, I initially thought that the UTF-8 warning would even appear when opening it for the first time (without any fields filled). Note that I can open the version filled and saved by pdf.js correctly in masterpdfeditor though.

calixteman commented 2 years ago

The saved xml contains some tags where the names have an accentued character (é and à). If I remove them from the pdf, everything is fine in pdf.js or acrobat.

calixteman commented 2 years ago

The serialized xml is a js string (utf-16) and we must encode it into utf-8 before saving: https://github.com/mozilla/pdf.js/blob/891f21fba6db64cd602c1a9a51826d7b9cd06af0/src/core/xfa/data.js#L78

@Snuffleupagus, I think this function: https://github.com/mozilla/pdf.js/blob/891f21fba6db64cd602c1a9a51826d7b9cd06af0/src/shared/util.js#L1017 should be called utf8StringToString and the following one stringToUTF8String or am I wrong ?