smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.3k stars 534 forks source link

Add ability to ignore PDF encryption check #632

Closed DivineOmega closed 9 months ago

DivineOmega commented 10 months ago

In some cases PDF files may be internally marked as encrypted even though the content is not encrypted and can be read.

This MR provides a config option to inform the PDF parser to ignore the encryption and attempt to read the PDF anyway.

This therefore provides a work around for the following issues:

GreyWyvern commented 10 months ago

This is a good addition, but as the OP says, this is a workaround. Eventually in the future the simple check in Parser::parseContent() should be modified to check if the document actually cannot be read.

        if (isset($xref['trailer']['encrypt'])) {
            throw new \Exception('Secured pdf file are currently not supported.');
        }

It should be taken into account that a future fix for this would obsolete the use of the config option being added here. That's probably the only thing I don't like about this change.

k00ni commented 9 months ago

@DivineOmega Are you still with us here?

DivineOmega commented 9 months ago

Hi. Sorry for the delayed response. Things have been busy recently.

I didn't end up actually using this functionality myself. I found that a majority of the PDFs I ignored the encryption check for would actually be parsed as containing no text or limited useful text. I'm not sure why this is and so my workaround here ended up not being useful for my use case.

This library still provides some of the best parsing I've found. My solution was to use an alternative parser if this one detected an encrypted PDF.

unixnut commented 7 months ago

@k00ni Can you please reopen and merge this, as in some cases the PDFs are from a predictable origin and are readable but are marked as encrypted. I believe it is up to the caller to test that the data they get is valid.

I am willing to write the test (using test.pdf from #488) and the docs. But first I would need agreement that the merge would be done if those conditions are met.

Thanks.

k00ni commented 7 months ago

@unixnut Thank you for your interest. You have my full support. It would be great if we could agree on the following list: