py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.29k stars 1.4k forks source link

working with checkbox with /Kids or extrange /V #961

Closed Luisonson closed 1 year ago

Luisonson commented 2 years ago

I'm trying to automate filling this PDF: TEMPORAL COMPLETO12 de mayo_unlocked.pdf

I have no problem with the text, but with the checkboxes there is no way. Many /Btn have /Kids those /kids are other checkboxes that appear as "indirectObject". Also, normal checkboxes I can't select/modify in this pdf (examples bellow)

Code

This example was written for the pypdf2 1.26.0 version

from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.generic import BooleanObject, NameObject, IndirectObject
from collections import OrderedDict

def set_need_appearances_writer(writer: PdfFileWriter):
    # See 12.7.2 and 7.7.2 for more information: http://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
    try:
        catalog = writer._root_object
        # get the AcroForm tree
        if "/AcroForm" not in catalog:
            writer._root_object.update({
                NameObject("/AcroForm"): IndirectObject(len(writer._objects), 0, writer)
            })

        need_appearances = NameObject("/NeedAppearances")
        writer._root_object["/AcroForm"][need_appearances] = BooleanObject(True)
        # del writer._root_object["/AcroForm"]['NeedAppearances']
        return writer

    except Exception as e:
        print('set_need_appearances_writer() catch : ', repr(e))
        return writer

reader = PdfFileReader("TEMPORAL COMPLETO12 de mayo_unlocked.pdf")
writer = PdfFileWriter()

set_need_appearances_writer(writer)

page = reader.pages[0]

writer.addPage(page)

#Texto4 works, but not the checkboxes
writer.updatePageFormFieldValues(
    writer.getPage(0), {'BOTON_TIPOJORNADA': '/1',
                        'BOTON_JORN': '/S',
                        'Texto4': 'Texto4'
                        }
)
with open("filled-out.pdf", "wb") as output_stream:
    writer.write(output_stream)
reader.stream.close()

If I modified the pdf manually and read the fields...:

reader.getFields()

OUTPUT (one checkbox selected):
[...]

'BOTON_JORN': {'/FT': '/Btn',
  '/Kids': [IndirectObject(160, 0),
   IndirectObject(162, 0),
   IndirectObject(167, 0),
   IndirectObject(172, 0)],
  '/T': 'BOTON_JORN',
  '/Ff': 49152,
  '/V': '/S'},

OUTPUT (another checkbox selected):
[...]

'BOTON_JORN': {'/FT': '/Btn',
  '/Kids': [IndirectObject(160, 0),
   IndirectObject(162, 0),
   IndirectObject(167, 0),
   IndirectObject(172, 0)],
  '/T': 'BOTON_JORN',
  '/Ff': 49152,
  '/V': '/D'},

Another checkbox, with NO /kids but I can't select/modify is: 'TEXTOCasilla de verificación25' when selected has the value '/S#ED'

'TEXTOCasilla de verificación25': {'/FT': '/Btn',
  '/T': 'TEXTOCasilla de verificación25',
  '/V': '/S#ED'},

Thanks for your time.

PDF

TEMPORAL COMPLETO12 de mayo_unlocked.pdf

MartinThoma commented 2 years ago

Thank you for your bug report!

Would you mind sharing your PyPDF2 version + the environment you're using? (It's part of the bug ticket template)

I have no problem with the text, but with the checkboxes there is no way.

What does that mean? There is no way to do what?

MartinThoma commented 2 years ago

@Luisonson Have you seen https://pypdf2.readthedocs.io/en/latest/user/forms.html#filling-out-forms ? Does that help? If not, why?

Luisonson commented 2 years ago

Thank you for your bug report!

Would you mind sharing your PyPDF2 version + the environment you're using? (It's part of the bug ticket template)

I have no problem with the text, but with the checkboxes there is no way.

What does that mean? There is no way to do what?

Hello,

Thanks for your answer. I'm using python 3.8.8 with pypdf2 2.1.0. My IDE is Spyder 5.1.5

I can't select/click the checkboxes or deselect. Also, some checkboxes appears just as /kids of another checkbox, so I can't interact with it as shown in the example with the checkbox BOTON_JORN that has 4 /kids... and those kids are another 4 checkboxes that the only thing I know about them is that are IndirectObject(X, 0).

@Luisonson Have you seen https://pypdf2.readthedocs.io/en/latest/user/forms.html#filling-out-forms ? Does that help? If not, why?

Yes, part of the code I have pasted is from there, but does not work in this PDF with the checkboxes.

MartinThoma commented 2 years ago

Is the problem that it's not shown? So maybe #227 / #355 ?

Luisonson commented 2 years ago

Another hint: With this pdf (is just page 5 of the previous PDF): filled-out_5.pdf

If I try to update the text boxes, is ok, BUT, if i try to update the checkboxes (unsusesfully), then the text of the boxes is not shown unless I select the box: New code:

Updating two text boxes This examples were written for the pypdf2 2.1.0 version

from PyPDF2 import PdfFileReader, PdfFileWriter

reader = PdfReader("filled-out_5.pdf")
writer = PdfWriter()
page = reader.pages[0]
fields3 = reader.get_fields()

writer.add_page(page)

writer.update_page_form_field_values(
    writer.getPage(0), {"Texto41": "Test38",
                        "Texto56": "Test2"}
)
with open("filled-out_5_out.pdf", "wb") as output_stream:
    writer.write(output_stream)
reader.stream.close()

Updating two textboxes and trying to update one checkbox (the bug of the text not showing appears)

from PyPDF2 import PdfFileReader, PdfFileWriter

reader = PdfReader("filled-out_5.pdf")
writer = PdfWriter()
page = reader.pages[0]
fields3 = reader.get_fields()

writer.add_page(page)

writer.update_page_form_field_values(
    writer.getPage(0), {"Texto41": "Test38",
                        "Texto56": "Test2"}
)
writer.update_page_form_field_values(
    writer.getPage(0), {"BOTON_TPCON1": "/540"}
)

# write "output" to PyPDF2-output.pdf
with open("filled-out_5_out.pdf", "wb") as output_stream:
    writer.write(output_stream)
reader.stream.close()

Also, another error. After the new file is saved, If you try to obtain the fields of the new file with:

reader = PdfReader("filled-out_5_out.pdf")
reader.get_fields()

Does not show any field. I have to open the pdf with Adobe and save it with the adobe, then the code below works

Luisonson commented 2 years ago

Is the problem that it's not shown? So maybe #227 / #355 ?

No, previusly I was using pypdf2 1.26 and i had the code to mitigate that issue (def set_need_appearances_writer(writer: PdfFileWriter)) on my first message. But with pypdf2 2.1.0 that function is not needed... until you try to modify a checkbox as I just told you in the previous message :(

MartinThoma commented 2 years ago

Oh, so it is a regression? It was working with 1.26 and now it is not working anymore with 2.1.0?

I'll have a closer look today evening after work :-)

Luisonson commented 2 years ago

Oh, so it is a regression? It was working with 1.26 and now it is not working anymore with 2.1.0?

I'll have a closer look today evening after work :-)

I'm sorry, maybe I'm messing up things. There are several problems . In one hand I have problems with the checkboxes (that problem is with both versions). On the other hand is the problem with the text not showing unless I select the textbox, this second problem only appears in 2.1.0 if I try to change a checkbox, the code that solved that issue in 1.26 seems does not solved it in 2.1.0. Please, use the last code I have pasted and I think you will see it clearer than with my poor explanation.

MartinThoma commented 2 years ago

I'll post a series of comments here to keep track / let people know how I investigate the issue.

# Split, so that we only have one page to care about
$ qpdf --split-pages=1 TEMPORAL.COMPLETO12.de.mayo_unlocked.pdf out.pdf

# Uncompress so that I can view it in an editor
$ qpdf --stream-data=uncompress out-01.pdf uncompressed-1.pdf

That gives uncompressed-1.pdf

MartinThoma commented 2 years ago

Next I used PyPDF2 to find the form fields and their names. I looked for /Btn and found TEXTOCasilla de verificación25.

Before filling it:

<< /AP
<< /D
<< /Off 124 0 R /S#ed 125 0 R >> /N
<< /S#ed 126 0 R >> >>
/AS /Off
/DA (/ZaDb 0 Tf 0 0 1 rg) /F 4 /FT /Btn /MK
<< /CA (8) >> /P 3 0 R /Rect [ 51.3755 235.625 63.0763 248.636 ]
/Subtype /Widget /T (TEXTOCasilla de verificación25) /Type /Annot >>

After:

<< /AP
<< /D
<< /Off 171 0 R /S#ed 172 0 R >> /N
<< /S#ed 173 0 R >> >>
/AS /S#ed
/DA (/ZaDb 0 Tf 0 0 1 rg) /F 4 /FT /Btn /MK
<< /CA (8) >> /P 3 0 R /Rect [ 51.3755 235.625 63.0763 248.636 ]
/Subtype /Widget /T (TEXTOCasilla de verificación25) /Type /Annot
/V /S#ed >>

I notice two differences:

  1. /AS /Off changed to /AS /S#ed
  2. /V /S#ed was added.
MartinThoma commented 2 years ago

@Luisonson This ticks one checkbox:

from PyPDF2 import PdfReader, PdfWriter
from PyPDF2.generic import NameObject
from typing import Dict

def update_checkbox_values(page, fields: Dict[str, bool]): 
    for j in range(0, len(page['/Annots'])):
        writer_annot = page['/Annots'][j].getObject()
        field_name = writer_annot.get('/T')
        if field_name in fields:
            print(f"Found {field_name}")
            assert writer_annot.get('/FT') == '/Btn'
            print(writer_annot)
            if fields[field_name]:
                print("\tCheck it")
                writer_annot.update({
                    NameObject("/V"): NameObject("/S#ed"),
                    NameObject("/AS"): NameObject("/S#ed"),
                })
                for key in writer_annot:
                    print((key, writer_annot[key]))
            else:
                writer_annot.update({
                    NameObject("/V"): NameObject("/No"),
                    NameObject("/AS"): NameObject("/Off")
                })

reader = PdfReader("TEMPORAL.COMPLETO12.de.mayo_unlocked.pdf")

# See which fields exist
fields = reader.get_form_text_fields()
print(fields)

writer = PdfWriter()
writer.set_need_appearances_writer()
writer.add_page(reader.pages[0])
update_checkbox_values(writer.pages[0], {"TEXTOCasilla de verificación25": False})

with open("filled-out.pdf", "wb") as output_stream:
    writer.write(output_stream)

Does this help?

Luisonson commented 2 years ago

Good Morning, Thanks for your time and efort. We are closer. With page5, for example: https://github.com/py-pdf/PyPDF2/files/8861867/filled-out_5.pdf

reader = PdfReader("filled-out_5.pdf")

# See which fields exist
fields = reader.getFields()
print(fields)

OUTPUT: {'TEXTOCasilla de verificación555': {'/FT': '/Btn', '/T': 'TEXTOCasilla de verificación555'}, 'BOTON_TPCON1': {'/FT': '/Btn', '/Kids': [IndirectObject(55, 0), IndirectObject(1586, 0)], '/T': 'BOTON_TPCON1', '/Ff': 49152, '/V': '/450401'}, 'Texto56': {'/FT': '/Tx', '/T': 'Texto56'}, 'Texto41': {'/FT': '/Tx', '/T': 'Texto41'}, 'BOTON_INT1': {'/FT': '/Btn', '/Kids': [IndirectObject(1597, 0), IndirectObject(1599, 0), IndirectObject(1604, 0), IndirectObject(1609, 0), IndirectObject(1614, 0), IndirectObject(1619, 0), IndirectObject(1624, 0), IndirectObject(1629, 0), IndirectObject(1634, 0), IndirectObject(1639, 0), IndirectObject(1644, 0), IndirectObject(1649, 0), IndirectObject(1654, 0)], '/T': 'BOTON_INT1', '/Ff': 49152}, 'BOTON_INT1357': {'/FT': '/Btn', '/T': 'BOTON_INT1357', '/Ff': 49152}, 'BOTON_INT166': {'/FT': '/Btn', '/T': 'BOTON_INT166', '/Ff': 49152}}

I need to modify BOTON_TPCON1, from /450401 to /540. But, with your example: writer.pages[0]['/Annots'][X].getObject().get('/T') only detects: Texto56 Texto41 BOTON_INT1357 BOTON_INT166

so....

On the other hand, yesterday someone told me about the fdf file, whitch is an ascii template (easy to modify), whitch you open and merge with the pdf and the pdf will pick up the values of the fdf file. Is pyPDF2 capable of handling fdf files? If not, would be a nice feature to add.

MartinThoma commented 2 years ago

I've seen fdf being mentioned somewhere, but I have no experience with it.

I'm open to PRs, but I also need to check if adding fdf support is in scope for PyPDF2.

Luisonson commented 2 years ago

For example, in my case, for change some values of the first page is:

%FDF-1.2
%âãÏÓ
1 0 obj
<</FDF<</F(TEMPORAL COMPLETO12 de mayo_unlocked_borrar1.pdf)/Fields[
<</T(BOTON_BON1)/V/Off>>
<</T(BOTON_CLA1)/V/Off>>
<</T(BOTON_CLA13)/V/Off>>
<</T(BOTON_CLA166)/V/Off>>
<</T(BOTON_DISBON)/V/Off>>
<</T(BOTON_DISC1)/V/Off>>
<</T(BOTON_EX44)/V/Off>>
<</T(BOTON_EXCL)/V/Off>>
<</T(BOTON_INS)/V/Off>>
<</T(BOTON_INT1)/V/Off>>
<</T(BOTON_INT1357)/V/Off>>
<</T(BOTON_INT166)/V/Off>>
<</T(BOTON_INVEMP)/V/Off>>
<</T(BOTON_INVEMP2)/V/Off>>
<</T(BOTON_INVEMP266)/V/Off>>
<</T(BOTON_INVEMP266332)/V/Off>>
<</T(BOTON_INVEMP999)/V/Off>>
<</T(BOTON_INVEMP999635)/V/Off>>
<</T(BOTON_INVEMP9997895)/V/Off>>
<</T(BOTON_INVEMP99988)/V/Off>>
<</T(BOTON_INVTIPO)/V/Off>>
<</T(BOTON_INVTIPO11)/V/Off>>
<</T(BOTON_INVTIPO117)/V/Off>>
<</T(BOTON_ISOC8962)/V/Off>>
<</T(BOTON_JORN)/V/S>>
<</T(BOTON_JORNasdf)/V/Off>>
<</T(BOTON_JORNcvbm)/V/D>>
<</T(BOTON_MAY)/V/Off>>
<</T(BOTON_MODAL1)/V/Off>>
<</T(BOTON_OTR)/V/Off>>
<</T(BOTON_REL2)/V/Off>>
<</T(BOTON_TIPOJORNADA)/V/1>>
<</T(BOTON_TPCON1)/V/Off>>
<</T(BOTON_TPCON100)/V/Off>>
<</T(BOTON_TPCON1006)/V/Off>>
<</T(BOTON_TPCON12)/V/Off>>
<</T(BOTON_TPCON196)/V/Off>>
<</T(BOTON_TPCON1969)/V/Off>>
<</T(BOTON_TPCON1985)/V/Off>>
<</T(BOTON_TPCON198745)/V/Off>>
<</T(BOTON_TPCON199)/V/Off>>
<</T(BOTON_VICT)/V/Off>>
<</T(ID_EMPR)/V(16083466A)>>
<</T(TEXTO Casilla de verificación 480)/V/Off>>
<</T(TEXTO Casilla de verificación 481)/V/Off>>
<</T(TEXTO20369)/V/Off>>
<</T(TEXTOCasilla de verificación106666)/V/Off>>
<</T(TEXTOCasilla de verificación12)/V/Off>>
<</T(TEXTOCasilla de verificación13)/V/Off>>
<</T(TEXTOCasilla de verificación25)/V/S#ED>>
<</T(TEXTOCasilla de verificación285)/V/Off>>
<</T(TEXTOCasilla de verificación2853)/V/Off>>
<</T(TEXTOCasilla de verificación28999)/V/Off>>
<</T(TEXTOCasilla de verificación289996)/V/Off>>
<</T(TEXTOCasilla de verificación3221)/V/Off>>
<</T(TEXTOCasilla de verificación32369)/V/Off>>
<</T(TEXTOCasilla de verificación327)/V/Off>>
<</T(TEXTOCasilla de verificación32987)/V/Off>>
<</T(TEXTOCasilla de verificación3299)/V/Off>>
<</T(TEXTOCasilla de verificación369877)/V/Off>>
<</T(TEXTOCasilla de verificación4)/V/Off>>
<</T(TEXTOCasilla de verificación43)/V/Off>>
<</T(TEXTOCasilla de verificación43968)/V/Off>>
<</T(TEXTOCasilla de verificación5)/V/Off>>
<</T(TEXTOCasilla de verificación51)/V/Off>>
<</T(TEXTOCasilla de verificación5189)/V/Off>>
<</T(TEXTOCasilla de verificación518977)/V/Off>>
<</T(TEXTOCasilla de verificación555)/V/Off>>
<</T(TEXTOCasilla de verificación6)/V/Off>>
<</T(TEXTOCasilla de verificación62)/V/Off>>
<</T(TEXTOCasilla de verificación622222)/V/Off>>
<</T(TEXTOCasilla de verificación626)/V/Off>>
<</T(TEXTOCasilla de verificación64)/V/Off>>
<</T(TEXTOCasilla de verificación65)/V/Off>>
<</T(TEXTOCasilla de verificación66)/V/Off>>
<</T(TEXTOCasilla de verificación661)/V/Off>>
<</T(TEXTOCasilla de verificación69)/V/Off>>
<</T(TEXTOCasilla de verificación6911)/V/Off>>
<</T(TEXTOCasilla de verificación7)/V/Off>>
<</T(TEXTOCasilla de verificación72)/V/Off>>
<</T(TEXTOCasilla de verificación7222)/V/Off>>
<</T(TEXTOCasilla de verificación723)/V/Off>>
<</T(TEXTOCasilla de verificación726)/V/Off>>
<</T(TEXTOCasilla de verificación8)/V/Off>>
<</T(TEXTOCasilla de verificación91)/V/Off>>
<</T(TEXTOCasilla de verificación911)/V/Off>>
<</T(TEXTOCasilla de verificación95555)/V/Off>>
<</T(Textocasilla de verificación3)/V/Off>>
<</T(Textocasilla de verificación30)/V/Off>>]
/ID[<25F5DFD17199935FF41213A08FEAFF84><9F88950AEDB5B44BBCEF4494778262B8>]
/UF(TEMPORAL COMPLETO12 de mayo_unlocked_borrar1.pdf)>>/Type/Catalog>>
endobj
trailer
<</Root 1 0 R>>
%%EOF

As you can see, it is quite simple and self-explicatory. BUT, pyPDF2 has to be capable of update any value. To open the fdf file and merge with the pdf I'm using pdftk, that is an old (9 years) exe... but does the job.

As an another example: For the file filled-out_5.pdf that I told you I'm not able to change the checkbox BOTON_TPCON1, The fdf file is (change .txt to .fdf): filled-out_5_datos.txt Quite simple and seems only altered the /V value.

To generate an fdf file, open the pdf file with acrobat -> file -> create -> create form

pubpub-zz commented 1 year ago

@MartinThoma I Propose to close this issue, unless you plan some work on FDF file but this is too far away from pdf for me

MartinThoma commented 1 year ago

I (sadly) have to agree: I don't see FDF support happening soon and I don't see us getting process here.

I have added a link to https://github.com/py-pdf/pypdf/discussions/1181 . Feel free to add here or there more details on FDF (PRs introducing support would also be very welcome!).

The fact that I'm closing this is a reflection on the fact that no core contributor will pick this up in the next half year. We want this support in pypdf, but we don't have the resources to make it happen any time soon.

Luisonson commented 1 year ago

OK, no problem. Thanks for your time.