smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.3k stars 534 forks source link

Account for inline images in formatContent() #693

Closed GreyWyvern closed 1 month ago

GreyWyvern commented 3 months ago

Type of pull request

About

formatContent() now accounts for inline image BI ... ID ... EI commands in document streams. Resolves #691.

Checklist for code / configuration changes

In case you changed the code/configuration, please read each of the following checkboxes as they contain valuable information:

GreyWyvern commented 3 months ago

Converting this to a draft for now. @iGrog supplied another PDF that still had the issue, and in fixing it, I'm sure there is an edge case: if a (string) contains the BI keyword, and then ID and EI can be found further on in the document, the potential is there for a large chunk of the document to be ignored. Very small chance this happens, but it's there.

The internal content of the captured BI ... ID ... EI needs to be checked to verify that it is indeed inline image content before allowing the replace. I'll work on this and update this PR when ready.

k00ni commented 3 months ago

I really appreciate you taking the time!

GreyWyvern commented 1 month ago

So, the last thing left here that the code wouldn't cover is a proper inline image, that doesn't have a proper image-properties dictionary with a width and height. The code in this PR then skips over it, but the potential is there for such an inline image (probably very rare if it happens at all) to contain binary content that can potentially cause errors in the way PdfParser interprets the document stream. (Like unbalanced Q/q etc.)

We can:

I've no data to back it up, but I believe the second case, where formatContent() finds a BI ... ID ... EI sequence outside of a string, but in error, is a probably rarer than an inline image dictionary not containing a height and width. But then again I could be wrong!

Regardless, I would recommend keeping the dictionary check just in case. If it gets released and users find the array-access error again, then we can always remove it. In this case, this PR is ready to be taken out of draft status as-is.

k00ni commented 1 month ago

Thank you very much @GreyWyvern