michaelrsweet / pdfio

PDFio is a simple C library for reading and writing PDF files.
https://www.msweet.org/pdfio
Apache License 2.0
199 stars 44 forks source link

Flattening filled forms and annotations in PDF files #20

Closed tillkamppeter closed 11 months ago

tillkamppeter commented 3 years ago

Follow-up from OpenPrinting micro-conference on Linux Plumbers 2021

cups-filters uses QPDF a lot for most of its non-rendering/rasterizing PDF handling tasks. Disadvantage of this is that QPDF is C++ (ugly, harder to understand/maintain/port). Filters (filter functions) using it are pdftopdf(), pclmtoraster(), rastertopdf(), pdftops(), ghostscript(), bannertopdf(). If one could replace QPDF by pdfio here, one could get rid oc C++ altogether in cups-filters.

Unfortunately, pdfio does not support all the functionality needed for cups-filters (QPDF only has it as QPDF author Jay Berkenbilt implemented my feature requests, together with some GSoC students). So after freeing cups-filters from use of undocumented Poppler APIs with the help of QPDF the next step is eliminating C++ with the help of pdfio.

One of the missing features is flattening filled PDF forms and also annotations, to make the filled/annotaded text being moved from obscure, separate data structures right into the graphical content of the PDF itself, to have a static PDF where the filled/annotated text is integral part of. Only this way one can reliably apply the page management functionality of pdftopdf(): number-up, print-scaling, page-ranges, page-set, ...

Currently pdftopdf() flattens the forms with QPDF before it does its page management work, also using QPDF. To convert it to only used pdfio, pdfio will need form/annotation-flattening capabilities.

michaelrsweet commented 3 years ago

I need to do more investigation on this. Certainly a program using PDFio could read the necessary data and add it to a PDF page stream, but if it isn’t too onerous I might be able to add this as an option for the pdfioPageCopy function.

michaelrsweet commented 3 years ago

@tillkamppeter Can you attach a few sample PDF files that require this flattening? I want to make sure I’m looking at the right version of PDF forms data (it has changed over the years…)

tillkamppeter commented 3 years ago

Here are some of these forms: Corona-Aufklärungs-Dokumentationsbogen.pdf Corona-Schutzimpfng_Impffragebogen.pdf form-gs-694734.pdf mcw_AnmeldeformularAutomatisierteBefundsammlung.pdf pdf-reisepass-antrag-erwachsene-data.pdf vbv_antrag_mitarbeiter-und-selbstaendige_180604.pdf

You can fill them and save the filled forms with evince.

All these forms are in original, empty state (to not post private information here). They are all German, but this could also help to check whether the mechanism work with special characters (äöüÄÖÜß).

tillkamppeter commented 3 years ago

Here are some more: interactiveform_enabled.pdf interactiveform_enabled_filled.pdf form_english.pdf form_russian.pdf

michaelrsweet commented 11 months ago

OK, so I've done some testing with both rasterization and "native" PDF printing by the various printers I have access to in my lab. Poppler/Xpdf's pdftoppm already handles rendering form content (both the elements and the current value), and the Lexmark, HP, and Rollo PDF printers I have all do so as well.

Since the form content is referenced by the page dictionary, it already gets copied by the pdfioPageCopy function. But we'll still need to make sure ipptransform and any of the cups-filter code that does N-up/imposition of multiple pages also copies the Annots values over. But that all happens outside PDFio...