pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.17k stars 495 forks source link

Password protected PDF documents #3865

Closed lfoppiano closed 2 weeks ago

lfoppiano commented 2 weeks ago

Description of the bug

I'm not sure this is a bug, there are PDF documents that are password protected from copying content. I was wondering if this kind is supported by PyMyPDF.

For example, with this document, I get needs_pass = 0 and encrypted = False. So I wonder if there is a way to get all the security related information.

How to reproduce the bug

doc = pymupdf.open("pdf_document")
print(f"needs pass: {doc.needs_pass}, is encrypted: {doc.is_encrypted}")

result:

needs pass: 0, is encrypted: False

PyMuPDF version

1.24.10

Operating system

MacOS

Python version

3.10

JorjMcKie commented 2 weeks ago

Sure you can:

perms={"access":pymupdf.PDF_PERM_ACCESSIBILITY,
 "annotate": pymupdf.PDF_PERM_ANNOTATE,
 "assemble":pymupdf.PDF_PERM_ASSEMBLE,
 "copy":pymupdf.PDF_PERM_COPY,
 "form":pymupdf.PDF_PERM_FORM,
 "modify":pymupdf.PDF_PERM_MODIFY,
 "print":pymupdf.PDF_PERM_PRINT,
 "print_hq":pymupdf.PDF_PERM_PRINT_HQ}

for k in perms.keys():
    print(k,"=",bool(doc.permissions & perms[k]))

access = True
annotate = False
assemble = False
copy = False
form = False
modify = False
print = True
print_hq = True

pprint(doc.metadata)
{'author': '',
 'creationDate': "D:20161114152803+09'00'",
 'creator': 'Adobe InDesign CS5_J (7.0.4)',
 'encryption': 'Standard V4 R4 128-bit RC4',
 'format': 'PDF 1.6',
 'keywords': '',
 'modDate': "D:20161114154551+09'00'",
 'producer': 'Adobe PDF Library 9.9',
 'subject': '',
 'title': '植物28-4_星ほか aid.indd',
 'trapped': ''}

This no bug: The PDF needs no password for access - IAW no user password needed. But the metadata show that the document is encrypted to limit access to a permitted subset. So an owner password is needed for full permissions.

Due to the ambiguous wording in the PDF specification, it may come across confusing that "copying" is prohibited, but we obviously can extract the text and make copies of it. This is no bug either, because any PDF viewer does (and must do) the same thing to display content. So prohibited copy means that PDF viewers should (!) disable copy/paste - that's all. All Python packages can extract text from this file.

lfoppiano commented 2 weeks ago

Thanks! I realized I picked up a PDF that was actually working.