pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.06k stars 488 forks source link

In some case the annotation using saveIncr breaks the signature of the original document #2062

Closed pulse-mind closed 1 year ago

pulse-mind commented 1 year ago

Describe the bug (mandatory)

To Reproduce (mandatory)

pdf = fitz.open(pdf_output_file_with_path)
        for page in pdf:
            #  Annot prototype.
            r = fitz.Rect(10, 10, 300, 20)
            t1 = u"1a8c9b3c-d3cb-4eeb-bfb6-a225d8553f69"
            annot = page.add_freetext_annot(
                r,
                t1,
                fontsize=10,
                rotate=0,
                text_color=(0, 0, 0),
                align=fitz.TEXT_ALIGN_LEFT,
            )
            annot.update(text_color=(0, 0, 0))

        pdf.saveIncr()

In some other cases (with some others document/signature) it works fine.

Expected behavior (optional)

The signature is not broken, acrobat validate the signature like it validates it on the original document.

Screenshots

signature_pymupdf

Your configuration (mandatory)

For example, the output of print(sys.version, "\n", sys.platform, "\n", fitz.__doc__) would be sufficient (for the first two bullets).

>>> print(sys.version, "\n", sys.platform, "\n", fitz.__doc__)
3.10.6 (main, Nov  2 2022, 18:53:38) [GCC 11.3.0]
 linux

PyMuPDF 1.21.0: Python bindings for the MuPDF 1.21.0 library.
Version date: 2022-11-08 00:00:01.
Built for Python 3.10 on linux (64-bit).
JorjMcKie commented 1 year ago

As you stated yourself, this situation occurs intermittently. So we cannot handle this issue without a reproducing file. Please provide a document where your code fails.

To keep signatures valid, incremental saving is required - but it is no guarantee either.

pulse-mind commented 1 year ago

Hi thank you for your answer. Can I send it on your email define on github?

JorjMcKie commented 1 year ago

Hi thank you for your answer. Can I send it on your email define on github?

Sure, but please be aware that I may have to share the file with colleagues from Artifex.

pulse-mind commented 1 year ago

Yes you can share it. I sent you the file.

Thanks a lot.

JorjMcKie commented 1 year ago

I have reviewed the file and made a few tests. Here are the findings: Method doc.saveIncr() works exactly as it should:

So, the reason for loosing the signed status must be, that there happens some check whether the file has been changed at all.

I am not sure which part of the signature field specification is responsible for this type of check: /Filter /Adobe.PPKLite?, /SubFilter /ETSI.CAdES.detached? It definitely is something outside (Py-) MuPDF's responsibility. There is no bug here.

BTW some other viewers (Foxit, Nitro Reader) do not reognize the signer as being validated - in contrast to Adobe, but that may be another problem.

JorjMcKie commented 1 year ago

As per the message ("byte range invalid"): Indeed the byte range in the signature is [0 37841 56787 773], which includes bytes of the full old file, excluding any data appended b/o incremental changes. The old file length is 56787 + 773 = 57560.

pulse-mind commented 1 year ago

Thanks a lot for the time spent on that. I will have a check with the person in charge of the signature...

pulse-mind commented 1 year ago

There is something strange because I have done the same thing with pdftron library using a test licence and also pdfbox (Java) and the signature was still valid.

JorjMcKie commented 1 year ago

There is something strange because I have done the same thing with pdftron library using a test licence and also pdfbox (Java) and the signature was still valid.

To see what actually has been happening when using incremental save, I compared the before / after versions of your nice example file with a "diff" generating program. This clearly proved that the changes exclusively happened by appending data after the original's final %%EOF\n line - which is what should be happening.

pulse-mind commented 1 year ago

Thank you I tried to use diff but it says that the files are binary files Binary files 221113_DOS-POLLOS_Feuille-de-presence.pdf and 221113_DOS-POLLOS_Feuille-de-presence_annot.pdf differ Are you doing something before to extract something from the files ?

I want to do the same as you and then compare with the results provided by others tools llike pdftron of pdfbox in order may be to give the right information to the signature server provider or to you. I understand what you are saying but I do not understand why with other tools it works fine in Acrobat and not when I use pymupdf.

Best regards, Fred

pulse-mind commented 1 year ago

I have done a less of the files and copy and past the final part after the %%EOF and I got these files. 1-pymupdf.txt 2-pdfbox.txt 3-pdftron.txt I do not really know how to manage that :D

JorjMcKie commented 1 year ago

Let us convert this to a "Discussions" item first. Then please attach the complete PDF of the pdfbox output, and I will compare that with the pymupdf output pdf. Thanks.