py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.47k stars 1.42k forks source link

Updated pdf fields don't show up when page is written #355

Closed segevmalool closed 1 year ago

segevmalool commented 7 years ago

I'd like to use PyPDF2 to fill out a pdf form. So far, everything is going smoothly, including updating the field text. But when I write the pdf to a file, there is apparently no change in the form. Running this code:

import datetime as dt
from PyPDF2 import PdfFileReader, PdfFileWriter
import re

form701 = PdfFileReader('ABC701LG.pdf')
page = form701.getPage(0)
filled = PdfFileWriter()

#removing extraneous fields
r = re.compile('^[0-9]')
fields = sorted(list(filter(r.match, form701.getFields().keys())), key = lambda x: int(x[0:2]))

filled.addPage(page)
filled.updatePageFormFieldValues(filled.getPage(0), 
                                 {fields[0]: 'some filled in text'})

print(filled.getPage(0)['/Annots'][0].getObject()['/T'])
print(filled.getPage(0)['/Annots'][0].getObject()['/V'])

with open('test.pdf','wb') as fp:
    filled.write(fp)

prints text:

1 EFFECTIVE DATE OF THIS SCHEDULE <i.e. the field name> some filled in text

But when I open up test.pdf, there is no added text on the page! Help!

mwhit74 commented 7 years ago

I am having this same issue. The data does not show up in Adobe Reader unless you activate the field. The data does show up in Bluebeam but if you print, flatten, or push the pdf to a studio session all the data is lost.

When the file is opened in Bluebeam it automatically thinks that the user has made changes, denoted by the asterisk next to the file name in the tab.

If you export the fdf file from Bluebeam all the data is in the fdf file in the proper place.

If you change any attribute of the field in Bluebeam or Adobe, it will recognize the text in that field. It will print correctly and flatten correctly. I am not sure if it will push to the Bluebeam studio but I assume it will. You can also just copy and paste the text in the field back into that field and it will render correctly.

I have not found any help after googling around all day. I think it is an issue with PyPDF2 not "redrawing" the PDF correctly.

I have contacted Bluebeam support and they have returned saying essentially that it is not on their end.

mwhit74 commented 7 years ago

Ok I think I have narrowed this down some by just comparing two different pdfs.

For reference I am trying to read a pdf that was originally created by Bluebeam, use the updatePageFormFields() function in PyPDF2 to push a bunch of data from a database into the form fields, and save. At some point we want to flatten these and that is when it all goes wrong in Bluebeam. In Adobe it is messed up from the start in that you don't see any values in the form fields until you scroll over them with the mouse.

I appears there is a problem with the stream object that follows the object(s) representing the text form field. See below.

This is a sample output from a pdf generated by PyPDF2 for a text form field:

26 0 obj<</Subtype/Widget/M(D:20160512102729-05'00')/NM(OEGVASQHFKGZPSZW)/MK<</IF<</A[0 0]>>>>/F 4/C[1 0 0]/Rect[227.157 346.3074 438.2147 380.0766]/V(Marshall CYG)/Type/Annot/FT/Tx/AP<</N 27 0 R>>/DA(0 0 0 rg /Helv 12 Tf)/T(Owner Group)/BS 29 0 R/Q 0/P 3 0 R>>
endobj
27 0 obj<</Type/XObject/Matrix[1 0 0 1 0 0]/Resources<</ProcSet[/PDF/Text]/Font<</Helv 28 0 R>>>>/Length 41/FormType 1/BBox[0 0 211.0577 33.76923]/Subtype/Form>>
stream
0 0 211.0577 33.76923 re W n /Tx BMC EMC 
endstream
endobj
28 0 

And if I back up and edit the same based file in Bluebeam the output from that pdf for a text form field looks like this (I think the border object can be ignored):

16 0 obj<</Type/Annot/P 5 0 R/F 4/C[1 0 0]/Subtype/Widget/Q 0/FT/Tx/T(Owner Group)/MK<</IF<</A[0 0]>>>>/DA(0 0 0 rg /Helv 12 Tf)/AP<</N 18 0 R>>/M(D:20170906125217-05'00')/Rect[227.157 346.3074 438.2147 380.0766]/NM(OEGVASQHFKGZPSZW)/BS 17 0 R/V(Marshall CYG)>>
endobj
17 0 obj<</W 1/S/S/Type/Border>>
endobj
18 0 obj<</Type/XObject/Subtype/Form/FormType 1/BBox[0 0 211.0577 33.7692]/Resources<</ProcSet[/PDF/Text]/Font<</Helv 12 0 R>>>>/Matrix[1 0 0 1 0 0]/Length 106>>
stream
0 0 211.0577 33.7692 re W n /Tx BMC BT 0 0 0 rg /Helv 12 Tf 1 0 0 1 2 12.6486 Tm (Marshall CYG) Tj ET EMC 
endstream

Ok so the biggest difference here is the stream object at the end. The value /V(Marshall CYG) gets updated in the first object of each pdf, objects 26 and 16 respectively. However the stream object in the PyPDF2 generated pdf does not get updated and the stream object from Bluebeam does get updated.

In testing this theory I made a copy of the PyPDF2 pdf and manually edited the stream object in a text editor. I open this new file in Bluebeam and flattened it. It worked. This also appears to work in adobe reader.

Now how to fix....

ademidun commented 6 years ago

A potential solution seems to be setting the Need Appearances flag. Not yet sure how to implement in pypdf2 but these 2 links may provide some clues: https://stackoverflow.com/questions/12198742/pdf-form-text-hidden-unless-clicked https://forums.adobe.com/thread/305250

ademidun commented 6 years ago

Okay, I think I have figured it out. If you read section 12.7.2 (page 431) of the PDF 1.7 specification, you will see that you need to set the NeedAppearances flag of the Acroform.

reader = PdfFileReader(open(infile, "rb"), strict=False)

if "/AcroForm" in reader.trailer["/Root"]:
    reader.trailer["/Root"]["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)}
    )
writer = PdfFileWriter()

if "/AcroForm" in writer._root_object:
    writer._root_object["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)}
    )
Tromar44 commented 6 years ago

ademidun - Can you elaborate on your suggested solution above? I too am having problems with pdf forms, edited with PyPDF2, not showing field values without clicking in the field. With the code example below, how do you "set the NeedAppearances flag of the Acroform"?

from PyPDF2 import PdfFileWriter, PdfFileReader

output = PdfFileWriter()
input = PdfFileReader(open("myInputPdf.pdf", "rb"))

field_dictionary = {'Make': 'Toyota', 'Model': 'Tacoma'}

for pageNum in range(input.numPages):
    pageObj = input.getPage(pageNum)
    output.addPage(pageObj)
    output.updatePageFormFieldValues(pageObj, field_dictionary)

outputStream = open("myOutputPdf.pdf", "wb")
output.write(outputStream)

I tried adding in your IF statements but two problems arise: 1) NameObject and BooleanObject are not defined within my PdfFileReader "input" variable (I do not know how to do that) and 2) "/AcroForm" is not found within the PdfFileWriter object (my "output" variable).

Thanks for any help!

ademidun commented 6 years ago

@Tromar44 Preamble, make sure your form is interactive. E.g. The pdf must already have editable fields.

1) Sorry forgot to mention you will have to import them: from PyPDF2.generic import BooleanObject, NameObject, IndirectObject 2) Are you sure you are using output.__root_object["/AcroForm"] or output.trailer["/Root"]["/AcroForm"] to access the "/AcroForm" key? and not just doing output["/AcroForm"]

Tromar44 commented 6 years ago

@ademidun I thank you very much for your help but unfortunately I'm still not having any luck. To be clear, my simple test pdf form does have two editable fields and the script will populate them with "Toyota" and "Tacoma" respectively but those values are not visible unless I click on the field in the form (they become invisible again after the field loses focus). Here is the rewritten code that includes your suggestions and the results of running the code in inline comments.

from PyPDF2 import PdfFileWriter, PdfFileReader
from PyPDF2.generic import BooleanObject, NameObject, IndirectObject

infile = "myInputPdf.pdf"
outfile = "myOutputPdf.pdf"

reader = PdfFileReader(open(infile, "rb"), strict=False)
if "/AcroForm" in reader.trailer["/Root"]: # result: following "IF code is executed
    print(True)
    reader.trailer["/Root"]["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)})

writer = PdfFileWriter()
if "/AcroForm" in writer._root_object: # result: False - following "IF" code is NOT executed
    print(True)
    writer._root_object["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)})

if "/AcroForm" in writer._root_object["/AcroForm"]: # result: "KeyError: '/AcroForm'
    print(True)
    writer._root_object["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)})

if "/AcroForm" in writer.trailer["/Root"]["/AcroForm"]:  # result: AttributeError: 'PdfFileWriter' object has no attribute 'trailer'
    print(True)
    writer._root_object["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)})

field_dictionary = {"Make": "Toyota", "Model": "Tacoma"}

writer.addPage(reader.getPage(0))
writer.updatePageFormFieldValues(writer.getPage(0), field_dictionary)

outputStream = open(outfile, "wb")
writer.write(outputStream)

I would definitely appreciate any more suggestions that you may have! Thank you very much!

ademidun commented 6 years ago

It may also be a browser issue. I don't have the links anymore but I remember reading about some issues when opening/creating a PDF on Preview on Mac or viewing it in the browser vs. using an Adobe app etc. Maybe if you google things like "form fields only showing on click" or "form fields only active on click using preview mac".

I also recommend reading the PDF spec link I posted, its a bit dense but a combination of all these should get you in the right direction.

ademidun commented 6 years ago

@Tromar44 Okay, I also found this snippet from my code, maybe it will help:

def set_need_appearances_writer(writer: PdfFileWriter):
    # See 12.7.2 and 7.7.2 for more information: http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
    try:
        catalog = writer._root_object
        # get the AcroForm tree
        if "/AcroForm" not in catalog:
            writer._root_object.update({
                NameObject("/AcroForm"): IndirectObject(len(writer._objects), 0, writer)
            })

        need_appearances = NameObject("/NeedAppearances")
        writer._root_object["/AcroForm"][need_appearances] = BooleanObject(True)
        # del writer._root_object["/AcroForm"]['NeedAppearances']
        return writer

    except Exception as e:
        print('set_need_appearances_writer() catch : ', repr(e))
        return writer
Tromar44 commented 6 years ago

@ademidun That worked perfectly (I'd high five you right now if I could)! Thank you very much! For anyone else interested, the following worked for me:

from PyPDF2 import PdfFileWriter, PdfFileReader
from PyPDF2.generic import BooleanObject, NameObject, IndirectObject

def set_need_appearances_writer(writer: PdfFileWriter):
    # See 12.7.2 and 7.7.2 for more information: http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
    try:
        catalog = writer._root_object
        # get the AcroForm tree
        if "/AcroForm" not in catalog:
            writer._root_object.update({
                NameObject("/AcroForm"): IndirectObject(len(writer._objects), 0, writer)})

        need_appearances = NameObject("/NeedAppearances")
        writer._root_object["/AcroForm"][need_appearances] = BooleanObject(True)
        return writer

    except Exception as e:
        print('set_need_appearances_writer() catch : ', repr(e))
        return writer

infile = "input.pdf"
outfile = "output.pdf"

reader = PdfFileReader(open(infile, "rb"), strict=False)
if "/AcroForm" in reader.trailer["/Root"]:
    reader.trailer["/Root"]["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)})

writer = PdfFileWriter()
set_need_appearances_writer(writer)
if "/AcroForm" in writer._root_object:
    writer._root_object["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)})

field_dictionary = {"Make": "Toyota", "Model": "Tacoma"}

writer.addPage(reader.getPage(0))
writer.updatePageFormFieldValues(writer.getPage(0), field_dictionary)

with open(outfile, "wb") as fp:
    writer.write(fp)
kissmett commented 6 years ago

@ademidun you great!!!

caver456 commented 6 years ago

Just stumbled upon this solution - great work! A couple of issues I noticed - can you reproduce them? - won't have time to send test case details for a couple of days yet if you need them; we had been using the good-ol fdfgen-then-pdftk-subprocess-call method but would like to get away from the external pdftk dependency so pypdf2 is great:

shurshilov commented 6 years ago

output.pdf Does not work in the fields in this file, for example, the first field for the phone, the second one for some reason works and a few more fields, so the fix is ​​not working

saipawan999 commented 6 years ago

Hi i am facing the same issue...i have tried setting need lreferences true also.when i edited pdf using pypdf2 some fields are displaying correctly and some are displaying only after i click on that filed.Please help me out on this issue as it is blocking me from the work. Thank you

fvw222 commented 5 years ago

The code works great! but only for PDFs with one page. I tried splitting my PDF into several one page files and looped through it. This worked great but when I merged them back together, the click-to-reveal-text problem reemerged. The problem lies in the .addPage command for the PdfFileWritter.

for page_number in range(pdf.total_pages):
    pdf2.addPage(pdf.getPage(page_number))
    pdf2.updatePageFormFieldValues(pdf2.getPage(page_number), field_dictionary)

When I enter this and try to save, I get an error message: "TypeError: argument should be integer or None, not 'NullObject'" It seems that the .addpage does not append the filewriter but treats each page as a seperate object. Does some one have a solution for this?

Problem solved: I figured out the problem was I was running a protected PDF. I manually split the PDF and manually recombind it and now it works great. The solution is often right in front of your nose.

aatish29 commented 5 years ago

Hi All,

Thanks for your help.

I was able to view the text fields of the PDF Form using pypdf2. But still could not figure out to make the visibility(need appearances) of the checkbox of PDF Form.

Tried with this logic : catalog = writer._root_object if '/AcroForm' in catalog: writer._root_object["/AcroForm"].update( {NameObject("/NeedAppearances"): BooleanObject(True)})

Thanks in advance.

karnh commented 5 years ago

I found answer for checkboxes issue at https://stackoverflow.com/questions/35538851/how-to-check-uncheck-checkboxes-in-a-pdf-with-python-preferably-pypdf2.

def updateCheckboxValues(page, fields):

    for j in range(0, len(page['/Annots'])):
        writer_annot = page['/Annots'][j].getObject()
        for field in fields:
            if writer_annot.get('/T') == field:
                writer_annot.update({
                    NameObject("/V"): NameObject(fields[field]),
                    NameObject("/AS"): NameObject(fields[field])
                })

And as the comment says checked value could be anything depending on how the form was created. It was present in '/AP' for me. Which I extracted using list(writer_annot.get('/AP').get('/N').keys())[0].

madornetto commented 5 years ago

ok, I have implemented the above and it works on my pdf forms however once the form has been updated by the python it can't be run through the code a second time, as getFormFields returns an empty list. If I open the updated pdf in Adobe and add a space to the end of a form field value and save, run the code on the form again, getFormFields returns the correct list.

ghost commented 5 years ago

I am having the same problem: fields not visible fixed by above-mentioned set_need_appearances_writer() approach but getFormFields/pdftk dump_data_fields does not see them.

In addition, it looks like my fonts somehow get messed up: one of the fields is actually a barcode font. But, after going through PyPDF2 to make a copy with updated fields, the field that uses the barcode font in the original copy now uses one of the other fonts.

willingham commented 5 years ago

I'm experiencing the same click-to-reveal-text issue. Here are a few interesting things I have noticed.

mjl commented 5 years ago

t can't be run through the code a second time, as getFormFields returns an empty list.

For reference, I just stumbled on the same issue. The problem is that the generated pdf does not have an /AcroForm, and the easiest solution is probably to copy it over from the source file like this:

trailer = reader.trailer["/Root"]["/AcroForm"]
writer._root_object.update({
        NameObject('/AcroForm'): trailer
    })
Nivatius commented 5 years ago

@mjl can you elaborate how to implement those lines?

zoiiieee commented 4 years ago

anyone figure out a solution to set /NeedAppearance for a pdf with multiple pages?

sstamand commented 4 years ago

To include multiple pages to the output PDF, I added the pages from the template onto the outpuf file....

if "/AcroForm" in pdf2._root_object:
        pdf2._root_object["/AcroForm"].update(
                {NameObject("/NeedAppearances"): BooleanObject(True)})
        pdf2.addPage(pdf.getPage(0))
        pdf2.updatePageFormFieldValues(pdf2.getPage(0), student_data)
        **pdf2.addPage(pdf.getPage(1))
        pdf2.addPage(pdf.getPage(2))**
        outputStream = open(cs_output, "wb")
        pdf2.write(outputStream)
        outputStream.close()
zoiiieee commented 4 years ago

To include multiple pages to the output PDF, I added the pages from the template onto the outpuf file....

I tried the same thing but Need Appearances seems to apply only to the first page. All the fields on the second page are hidden until focused.

jeffneuen commented 4 years ago

Does anyone have a working fix for this issue for multi-page PDFs?

brunnurs commented 4 years ago

@mjl can you elaborate how to implement those lines?

You will have a pdf-reader reading in the origin file and a pdf-writer, creating the new pdf (see code of @Tromar44 above). Now you simply need to "copy" over the AcroForm with the lines @mjl presented.

hchillon commented 4 years ago

From all those explanations I arrived (as brunnurs stated) to this code. It works for me. Fill textentries and checkboxes for multipage pdf and all changes can be seen using any simple pdf reader.

`from PyPDF2 import PdfFileReader, PdfFileWriter from PyPDF2.generic import BooleanObject, NameObject, IndirectObject, TextStringObject

def set_need_appearances_writer(writer):

try:
    catalog = writer._root_object
    # get the AcroForm tree and add "/NeedAppearances attribute
    if "/AcroForm" not in catalog:
        writer._root_object.update({
            NameObject("/AcroForm"): IndirectObject(len(writer._objects), 0, writer)})

    need_appearances = NameObject("/NeedAppearances")
    writer._root_object["/AcroForm"][need_appearances] = BooleanObject(True)
    return writer

except Exception as e:
    print('set_need_appearances_writer() catch : ', repr(e))
    return writer

class PdfFileFiller(object):

def __init__(self, infile):

    self.pdf = PdfFileReader(open(infile, "rb"), strict=False)
    if "/AcroForm" in self.pdf.trailer["/Root"]:
        self.pdf.trailer["/Root"]["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)})

def update_form_values(self, outfile, newvals=None, newchecks=None):

    self.pdf2 = MyPdfFileWriter()

    trailer = self.pdf.trailer["/Root"]["/AcroForm"]
    self.pdf2._root_object.update({
        NameObject('/AcroForm'): trailer})

    set_need_appearances_writer(self.pdf2)
    if "/AcroForm" in self.pdf2._root_object:
        self.pdf2._root_object["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)})

    for i in range(self.pdf.getNumPages()):
        self.pdf2.addPage(self.pdf.getPage(i))
        self.pdf2.updatePageFormFieldValues(self.pdf2.getPage(i), newvals)
        self.pdf2.updatePageFormCheckboxValues(self.pdf2.getPage(i), newchecks)

    with open(outfile, 'wb') as out:
        self.pdf2.write(out)

class MyPdfFileWriter(PdfFileWriter):

def __init__(self):
    super().__init__()

def updatePageFormCheckboxValues(self, page, fields):

    for j in range(0, len(page['/Annots'])):
        writer_annot = page['/Annots'][j].getObject()
        for field in fields:
            if writer_annot.get('/T') == field:
                #print('-------------------------------------')
                #print('     FOUND', field)
                #print(writer_annot.get('/V'))
                writer_annot.update({
                    NameObject("/V"): NameObject(fields[field]),
                    NameObject("/AS"): NameObject(fields[field])
                })

if name == 'main':

    origin = '900in.pdf'
    destination = '900out.pdf'
    newvals = {"IDETNCON[0]": "A123456T",
                "NOMSOL[0]": "ARTICA S.L."}
    newchecks={"periodeliq1[0]": "/1"}

    c = PdfFileFiller(origin)
    c. update_form_values(outfile=destination,
                          newvals=newvals,
                          newchecks=newchecks)`
hchillon commented 4 years ago

Last code fails for checkboxes using some pdf readers. I modified my MyPdfWriter class:

`def updatePageFormCheckboxValues(self, page, fields):

    for j in range(0, len(page['/Annots'])):
        writer_annot = page['/Annots'][j].getObject()
        for field in fields:
            if writer_annot.get('/T') == field:
                if fields[field] in ('/1', '/Yes'): # You choose which values use in your code
                    writer_annot.update({
                        NameObject("/V"): NameObject(fields[field]),
                        NameObject("/AS"): NameObject(fields[field])
                    })`
giorgio-pap commented 4 years ago

I am still having issues in showing filled boxed outside of Adobe Acrobat.

from PyPDF2 import PdfFileWriter, PdfFileReader
from PyPDF2.generic import BooleanObject, NameObject, IndirectObject

def set_need_appearances_writer(writer: PdfFileWriter):
    # See 12.7.2 and 7.7.2 for more information: http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
    try:
        catalog = writer._root_object
        # get the AcroForm tree
        if "/AcroForm" not in catalog:
            writer._root_object.update({
                NameObject("/AcroForm"): IndirectObject(len(writer._objects), 0, writer)})

        need_appearances = NameObject("/NeedAppearances")
        writer._root_object["/AcroForm"][need_appearances] = BooleanObject(True)
        return writer

    except Exception as e:
        print('set_need_appearances_writer() catch : ', repr(e))
        return writer

infile = "input.pdf"
outfile = "output.pdf"

pdf = PdfFileReader(open(infile, "rb"), strict=False)
if "/AcroForm" in pdf.trailer["/Root"]:
    pdf.trailer["/Root"]["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)})

pdf2 = PdfFileWriter()
set_need_appearances_writer(pdf2)
if "/AcroForm" in pdf2._root_object:
    pdf2._root_object["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)})

field_dictionary = {"iban1_part1": "DE", "Model": "Tacoma"}

pdf2.addPage(pdf.getPage(0))
pdf2.updatePageFormFieldValues(pdf2.getPage(0), field_dictionary)

outputStream = open(outfile, "wb")
pdf2.write(outputStream)

Some boxes are showing properly, some are not - when outside of Acrobat and I need to click on them to show the content.

I also did the same using pdfrw but I got stuck exactly at the same problem.

hchillon commented 4 years ago

Hi, giorgio-pap. I'm using the code in a project that I'm developing in order to fill tax forms in Andorra. Because of your comment I have been testing the code and these are my results:

As I'm not a Windows user, I don't use Adobe PDF tools. MasterPDF and qpdfview are my best alternatives working with Linux. Can you test your code with these alternatives?

hchillon commented 4 years ago

Hi again, giorgio-pap. Have you check issue #545?

giorgio-pap commented 4 years ago

Hi, giorgio-pap. I'm using the code in a project that I'm developing in order to fill tax forms in Andorra. Because of your comment I have been testing the code and these are my results:

* A lot of problems with Adobe Acrobat 9.0 (Last available version for Manjaro Linux)

* Good results with MasterPDF (https://code-industry.net/masterpdfeditor/)

* Good results with qdpview (https://github.com/bendikro/qpdfview)

As I'm not a Windows user, I don't use Adobe PDF tools. MasterPDF and qpdfview are my best alternatives working with Linux. Can you test your code with these alternatives?

Thanks a lot for your reply! Unfortunately, this script is meant to work for a whole company. So it is necesarry that the ouptut is steady with every most common reading softwares, since I can not require anyone to install anything.

SlawoKleeb commented 4 years ago

@ademidun That worked perfectly (I'd high five you right now if I could)! Thank you very much! For anyone else interested, the following worked for me:

from PyPDF2 import PdfFileWriter, PdfFileReader
from PyPDF2.generic import BooleanObject, NameObject, IndirectObject

def set_need_appearances_writer(writer: PdfFileWriter):
    # See 12.7.2 and 7.7.2 for more information: http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
    try:
        catalog = writer._root_object
        # get the AcroForm tree
        if "/AcroForm" not in catalog:
            writer._root_object.update({
                NameObject("/AcroForm"): IndirectObject(len(writer._objects), 0, writer)})

        need_appearances = NameObject("/NeedAppearances")
        writer._root_object["/AcroForm"][need_appearances] = BooleanObject(True)
        return writer

    except Exception as e:
        print('set_need_appearances_writer() catch : ', repr(e))
        return writer

infile = "myInputPdf.pdf"
outfile = "myOutputPdf.pdf"

pdf = PdfFileReader(open(infile, "rb"), strict=False)
if "/AcroForm" in pdf.trailer["/Root"]:
    pdf.trailer["/Root"]["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)})

pdf2 = PdfFileWriter()
set_need_appearances_writer(pdf2)
if "/AcroForm" in pdf2._root_object:
    pdf2._root_object["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)})

field_dictionary = {"Make": "Toyota", "Model": "Tacoma"}

pdf2.addPage(pdf.getPage(0))
pdf2.updatePageFormFieldValues(pdf2.getPage(0), field_dictionary)

outputStream = open(outfile, "wb")
pdf2.write(outputStream)

purrs like a kitten :-)

VoidIsEverywhere commented 4 years ago

I am still having issues in showing filled boxed outside of Adobe Acrobat.

from PyPDF2 import PdfFileWriter, PdfFileReader
from PyPDF2.generic import BooleanObject, NameObject, IndirectObject

def set_need_appearances_writer(writer: PdfFileWriter):
    # See 12.7.2 and 7.7.2 for more information: http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
    try:
        catalog = writer._root_object
        # get the AcroForm tree
        if "/AcroForm" not in catalog:
            writer._root_object.update({
                NameObject("/AcroForm"): IndirectObject(len(writer._objects), 0, writer)})

        need_appearances = NameObject("/NeedAppearances")
        writer._root_object["/AcroForm"][need_appearances] = BooleanObject(True)
        return writer

    except Exception as e:
        print('set_need_appearances_writer() catch : ', repr(e))
        return writer

infile = "input.pdf"
outfile = "output.pdf"

pdf = PdfFileReader(open(infile, "rb"), strict=False)
if "/AcroForm" in pdf.trailer["/Root"]:
    pdf.trailer["/Root"]["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)})

pdf2 = PdfFileWriter()
set_need_appearances_writer(pdf2)
if "/AcroForm" in pdf2._root_object:
    pdf2._root_object["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)})

field_dictionary = {"iban1_part1": "DE", "Model": "Tacoma"}

pdf2.addPage(pdf.getPage(0))
pdf2.updatePageFormFieldValues(pdf2.getPage(0), field_dictionary)

outputStream = open(outfile, "wb")
pdf2.write(outputStream)

Some boxes are showing properly, some are not - when outside of Acrobat and I need to click on them to show the content.

I also did the same using pdfrw but I got stuck exactly at the same problem. I tried this this code but, nothing appears on linux defualt pdfveiwer but all fields are visable on adobe and if you open it on gmail on most platforms. But, not on iphones it only shows some fields I poked around a bit and I think might have something to do with the PDF format but could not solve it with this python tool. I found another that did the job and was viewable from all platforms tested. https://www.blog.pythonlibrary.org/2018/05/22/filling-pdf-forms-with-python/ The code I used is at the very bottom titled "using the pdfforms package". The down side is the code so far hasn't been successfully ran on anything but, Linux and It doesn't click boxes.

EricSamsonCarto commented 4 years ago

Might be a separate issue, but I am having a similar problem with PdfFileMerger(). After merging two PDFs together, one having filled forms, the filled form values do not carry over to the final merged version. However, the values do appear when clicking into one of the forms, weirdly enough. I was wondering if I could apply the above logic, but for PdfFileMerger() instead of PdfFileWriter(), but I'm not sure how to implement that. The append section of my code, simplified:

temp_pdf = r"path.pdf" appendpdf = r"path.pdf" merger = PdfFileMerger() merger.append(PdfFileReader(temp_pdf)) merger.append(PdfFileReader(appendpdf)) merger.write(temp_pdf) merger.close()

The temp_pdf is the one with forms, the appendpdf is typically an image. I'm writing the final merged PDF back to the temp_pdf to overwrite it, that might be a problem, im not sure.

charalamm commented 4 years ago

Hello everyone! I tried @hchillon code and it works fine for me. Thanks @hchillon you for sharing it!!

I would like to note that the code does not to the job when the newvals dict has empty values. For example newvals = {'something':'', 'smth2':'smth'} would make again the values appear only when the field is clicked. I am posting this for everyone who has a hard time figuring out why it doesn;t work.

From all those explanations I arrived (as brunnurs stated) to this code. It works for me. Fill textentries and checkboxes for multipage pdf and all changes can be seen using any simple pdf reader.

`from PyPDF2 import PdfFileReader, PdfFileWriter from PyPDF2.generic import BooleanObject, NameObject, IndirectObject, TextStringObject

def set_need_appearances_writer(writer):

try:
    catalog = writer._root_object
    # get the AcroForm tree and add "/NeedAppearances attribute
    if "/AcroForm" not in catalog:
        writer._root_object.update({
            NameObject("/AcroForm"): IndirectObject(len(writer._objects), 0, writer)})

    need_appearances = NameObject("/NeedAppearances")
    writer._root_object["/AcroForm"][need_appearances] = BooleanObject(True)
    return writer

except Exception as e:
    print('set_need_appearances_writer() catch : ', repr(e))
    return writer

class PdfFileFiller(object):

def __init__(self, infile):

    self.pdf = PdfFileReader(open(infile, "rb"), strict=False)
    if "/AcroForm" in self.pdf.trailer["/Root"]:
        self.pdf.trailer["/Root"]["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)})

def update_form_values(self, outfile, newvals=None, newchecks=None):

    self.pdf2 = MyPdfFileWriter()

    trailer = self.pdf.trailer["/Root"]["/AcroForm"]
    self.pdf2._root_object.update({
        NameObject('/AcroForm'): trailer})

    set_need_appearances_writer(self.pdf2)
    if "/AcroForm" in self.pdf2._root_object:
        self.pdf2._root_object["/AcroForm"].update(
        {NameObject("/NeedAppearances"): BooleanObject(True)})

    for i in range(self.pdf.getNumPages()):
        self.pdf2.addPage(self.pdf.getPage(i))
        self.pdf2.updatePageFormFieldValues(self.pdf2.getPage(i), newvals)
        self.pdf2.updatePageFormCheckboxValues(self.pdf2.getPage(i), newchecks)

    with open(outfile, 'wb') as out:
        self.pdf2.write(out)

class MyPdfFileWriter(PdfFileWriter):

def __init__(self):
    super().__init__()

def updatePageFormCheckboxValues(self, page, fields):

    for j in range(0, len(page['/Annots'])):
        writer_annot = page['/Annots'][j].getObject()
        for field in fields:
            if writer_annot.get('/T') == field:
                #print('-------------------------------------')
                #print('     FOUND', field)
                #print(writer_annot.get('/V'))
                writer_annot.update({
                    NameObject("/V"): NameObject(fields[field]),
                    NameObject("/AS"): NameObject(fields[field])
                })

if name == 'main':

    origin = '900in.pdf'
    destination = '900out.pdf'
    newvals = {"IDETNCON[0]": "A123456T",
                "NOMSOL[0]": "ARTICA S.L."}
    newchecks={"periodeliq1[0]": "/1"}

    c = PdfFileFiller(origin)
    c. update_form_values(outfile=destination,
                          newvals=newvals,
                          newchecks=newchecks)`
CTMBNara commented 3 years ago

If you suddenly help someone. I had the same issue, solution didn't help for PDF Reader Pro and for standard preview function on Mac OS. Comparing several pdf files, the following helped me:

ap = NameObject('/AP')
for pageNumber in range(writer.getNumPages()):
    if '/Annots' not in writer.getPage(pageNumber):
        continue

    annotationsCount = len(writer.getPage(pageNumber)['/Annots'])
    for annotationNumber in range(annotationsCount):
        annotation = writer.getPage(pageNumber)['/Annots'][annotationNumber].getObject()
        if annotation['/FT'] == '/Tx' and\
                '/AP' in annotation and '/N' in annotation['/AP']:
            annotation[ap] = annotation['/AP']['/N']
ale-rt commented 3 years ago

I think the issue is related to the writer not being initialized properly. I resolved the issue copying some data from the reader, see:

#!/usr/bin/env python3
from PyPDF4.generic import NameObject
from PyPDF4.generic import TextStringObject
from PyPDF4.pdf import PdfFileReader
from PyPDF4.pdf import PdfFileWriter

import random
import sys

reader = PdfFileReader(sys.argv[1])

writer = PdfFileWriter()
# Try to "clone" the original one (note the library has cloneDocumentFromReader)
# but the render pdf is blank
writer.appendPagesFromReader(reader)
writer._info = reader.trailer["/Info"]
reader_trailer = reader.trailer["/Root"]
writer._root_object.update(
    {
        key: reader_trailer[key]
        for key in reader_trailer
        if key in ("/AcroForm", "/Lang", "/MarkInfo")
    }
)

page = writer.getPage(0)

params = {"Foo": "Bar"}

# Inspired by updatePageFormFieldValues but also handle checkboxes
for annot in page["/Annots"]:
    writer_annot = annot.getObject()
    field = writer_annot["/T"]
    if writer_annot["/FT"] == "/Btn":
        value = params.get(field, random.getrandbits(1))
        if value:
            writer_annot.update(
                {
                    NameObject("/AS"): NameObject("/On"),
                    NameObject("/V"): NameObject("/On"),
                }
            )
    elif writer_annot["/FT"] == "/Tx":
        value = params.get(field, field)
        writer_annot.update(
            {
                NameObject("/V"): TextStringObject(value),
            }
        )

with open(sys.argv[2], "wb") as f:
    writer.write(f)

See also https://stackoverflow.com/a/66388344/646005

Dpats13 commented 3 years ago

After reading through this thread and trying many of the suggested solutions above, I still was getting strange behavior when previewing the PDF in an application that was not dedicated to viewing / editing PDFs (ex. mobile email client). The PDF would display without showing any of the filled form fields. After piecing together a few solutions mentioned above, I realized that the order is critical in getting the correct behavior. Here is the solution I am using today:

`

def _set_need_appearances_writer(writer: PdfFileWriter):
    """
    Enables PDF filled form values to be visible on the final PDF results

    NOTE: See 12.7.2 and 7.7.2 for more information:
    http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
    """
    try:
        catalog = writer._root_object
        # get the AcroForm tree
        if "/AcroForm" not in catalog:
            writer._root_object.update({
                NameObject("/AcroForm"): IndirectObject(len(writer._objects), 0, writer)
            })

        need_appearances = NameObject("/NeedAppearances")
        writer._root_object["/AcroForm"][need_appearances] = BooleanObject(True)
        # del writer._root_object["/AcroForm"]['NeedAppearances']
        return writer

    except Exception as e:
        print('set_need_appearances_writer() catch : ', repr(e))
        return writer

def _lock_form_fields(cls, page):
    """
    Locks all form fields on the given PyPdf2 Page object
    """
    for j in range(0, len(page['/Annots'])):
        writer_annot = page['/Annots'][j].getObject()
        if writer_annot.get('/T'):
            writer_annot.update({
                NameObject("/Ff"): NumberObject(1)
            })

Putting it all together

def _init_pdf_writer_from_reader(cls, reader: PdfFileReader) -> PdfFileWriter:
    """
    Initializes a PdfFileWriter that can be used to write data to the given PDF
    stored inside of the PdfFileReader.

    IMPORTANT: Using this init function ensures that the data written is visible
               both in a PDF Viewer Application and in a Preview context (i.e. an email client)
    """
    if not reader or reader.getNumPages() == 0:
        raise Exception(f"Error initializing PdfFileWriter, given PdfFileReader "
                        f"is either null or contains no pages.")

    pdf_writer = PdfFileWriter()

    # Add all PDF pages from reader -> writer
    pdf_writer.appendPagesFromReader(reader)

    # Copy over additional data from reader -> writer
    pdf_writer._info = reader.trailer["/Info"]
    reader_trailer = reader.trailer["/Root"]
    pdf_writer._root_object.update(
        {
            key: reader_trailer[key]
            for key in reader_trailer
            if key in ("/AcroForm", "/Lang", "/MarkInfo")
        }
    )

    # Set written data appearances to be visible
    cls._set_need_appearances_writer(pdf_writer)
    if "/AcroForm" in pdf_writer._root_object:
        pdf_writer._root_object["/AcroForm"].update(
            {NameObject("/NeedAppearances"): BooleanObject(True)})

    return pdf_writer

`

By initializing the PDF Writer correctly we ensure that the data written to the PDFs form fields will be visible without having to click the field in a PDF viewer application. We also guarantee it will be visible in a non-pdf viewing specific application which is important if your client / end-user is using an app that you cannot be sure of what it will be to view the PDF. Lastly I included a method to lock the fields on a given PDF page, that way it is no longer editable by your end-user (if this is desired behavior).

Thanks to @ale-rt and many others above.

apteryxlabs commented 3 years ago

@Dpats13 is your code part of a broader object definition? I'm thrown by the cls args.

Dpats13 commented 3 years ago

@apteryxlabs ya, you can ignore those.

Tromar44 commented 3 years ago

@Dpats13 Id like to implement your solution but my python/programming skills are not great. Do you mind posting some working code assuming variables similar to below? infile = 'myInputPdf.pdf' outfile = 'myOutputPdf.pdf' field_dictionary = {'foo':'bar'} One of the old solutions above offered by @ademidun via @Tromar44 above still works well for me for filling pdf forms and reading them but trying to go back and programmatically (ex. PyPDF2, pdfminer) read the content of those filled forms returns empty fields (ie. I can manually open the PDF and see the content of those fields without clicking them but reading them via python returns empty fields). If I manually open the PDF and save it before closing it, then I am able to programmatically read the fields. Any demo/example of your solution would be greatly appreciated - thanks!

lymanjohnson commented 3 years ago

If anyone is having issues writing to RadioGroup fields, here is my code that successfully updates TextFields, ListBoxes, RadioGroups, and Checkboxes.

def fill_pdf_form(infile, outfile, field_dictionary):
    inputStream = open(infile, "rb")
    pr = PdfFileReader(inputStream, strict=False)
    if "/AcroForm" in pr.trailer["/Root"]:
        pr.trailer["/Root"]["/AcroForm"].update({NameObject("/NeedAppearances"): BooleanObject(True)})
    pw = PdfFileWriter()
    set_need_appearances_writer(pw)
    if "/AcroForm" in pw._root_object:
        pw._root_object["/AcroForm"].update({NameObject("/NeedAppearances"): BooleanObject(True)})
    for pageNum in range(pr.numPages):
        pw.addPage(pr.getPage(pageNum))
        pw.updatePageFormFieldValues(pw.getPage(pageNum), field_dictionary)
    if "/AcroForm" in pr.trailer["/Root"]:
        pw._root_object.update({NameObject('/AcroForm'): pr.trailer["/Root"]["/AcroForm"]})
## this next part manually updates RadioGroup items, which aren't updated by PyPDF2's updatePageFormFieldValues()
    for pageNum in range(pw.getNumPages()):
        page = pw.getPage(pageNum)
        annots = page['/Annots']
        for j in range(0, len(annots)):
            writer_annot = page['/Annots'][j].getObject()
            if writer_annot.get('/T') == None:
                parent_ido = writer_annot.get('/Parent')
                if parent_ido:
                    parent_obj = parent_ido.getObject()
                    radiogroup_name = parent_obj.get('/T')
                    if radiogroup_name:
                        for field in field_dictionary:
                            if field == radiogroup_name:
                                parent_obj.update({NameObject("/V"): NameObject('/{}'.format(field_dictionary[field])), }) 
    outputStream = open(outfile, "wb")
    pw.write(outputStream)
    inputStream.close()
    outputStream.close()
EricSamsonCarto commented 3 years ago

@lymanjohnson This works for PDFs with multiple pages? I had something similar that was still failing on multiple pages.

MartinThoma commented 2 years ago

I still see this issue in the Atril document viewer for test_fill_form.

MartinThoma commented 2 years ago

Aparently some people had luck with

# Set /NeedAppearances
writer.set_need_appearances_writer()

# Make it read-only with /Ff:
writer.updatePageFormFieldValues(writer.getPage(0), {"foo": "some filled in text"}, flags=1)

However, at least with Evince this doesn't work. And the Google Chrome PDF viewer always shows the filled fields.

mjl commented 2 years ago

I see quite a few comments that /NeedAppearances makes it work, but I'm sorry, that is not a general solution. This hints to the reader app that it needs to do some work to render correct form fields, but there are a lot of reader apps out there that do not honor that flag or do it badly.

What one needs to do is to really go over all the form fields and render them (ie. text input field -> add an Appearance Stream /AP that renders the entered value, checkboxes -> add Appearance State /AS that shows the field checked, other field types probably need even more work, this I have not investigated because I did not need those thus far).

What I ended up doing is inspect Acrobat generated forms and emulating that. I think I used qpdf --show-object to dissect the pdfs.

This comment helped me lots to get me started: https://github.com/pmaupin/pdfrw/issues/84#issuecomment-445303928

MartinThoma commented 2 years ago

Summarizing some ideas:

  1. We might need to set / adjust some annotations, see https://github.com/py-pdf/PyPDF2/issues/546#issuecomment-1179322743
  2. We might need to set / write the Fields dictionary
  3. The /NeedAppearances seems not to help as it was added via writer.set_need_appearances_writer()
  4. @hchillon is convinced the issue is that we need to set "fully qualified field name" : https://github.com/py-pdf/PyPDF2/issues/545#issue-603760149
fidoriel commented 2 years ago

Yesterday I took a deep dive into the PDF standard. I am 99% confident that this issue originates, like @mjl said, in a missing apperance stream. Today I was able to append a apperance Stream to the form field. The content of the filled field is now visible in Acrobat, SumatraPDF, Okular, Chrome, Firefox, Edge and it does also print. Now I am experiencing issues the layout of the text and special characters. I hope that I can solve them soon to be able to submit a PR.