GreyWyvern commented 3 months ago

Type of pull request

[X] Bug fix (involves code and configuration changes)

About

formatContent() now accounts for inline image BI ... ID ... EI commands in document streams. Resolves #691.

Checklist for code / configuration changes

In case you changed the code/configuration, please read each of the following checkboxes as they contain valuable information:

[X] Please add at least one test case (unit test, system test, ...) to demonstrate that the change is working. If existing code was changed, your tests cover these code parts as well.
[X] Please run PHP-CS-Fixer before committing, to confirm with our coding styles. See https://github.com/smalot/pdfparser/blob/master/.php-cs-fixer.php for more information about our coding styles.
[X] In case you fix an existing issue, please do one of the following:
- [X] Write in this text something like fixes #1234 to outline that you are providing a fix for the issue #1234.

GreyWyvern commented 3 months ago

Converting this to a draft for now. @iGrog supplied another PDF that still had the issue, and in fixing it, I'm sure there is an edge case: if a (string) contains the BI keyword, and then ID and EI can be found further on in the document, the potential is there for a large chunk of the document to be ignored. Very small chance this happens, but it's there.

The internal content of the captured BI ... ID ... EI needs to be checked to verify that it is indeed inline image content before allowing the replace. I'll work on this and update this PR when ready.

k00ni commented 3 months ago

I really appreciate you taking the time!

GreyWyvern commented 1 month ago

So, the last thing left here that the code wouldn't cover is a proper inline image, that doesn't have a proper image-properties dictionary with a width and height. The code in this PR then skips over it, but the potential is there for such an inline image (probably very rare if it happens at all) to contain binary content that can potentially cause errors in the way PdfParser interprets the document stream. (Like unbalanced Q/q etc.)

We can:

Just accept it as is; the document with such an inline image is malformed anyways. There should be no expectation of an error-free parsing in such a case.
Not check for the height and width in the dictionary at all, and just accept all BI ... ID ... EI sequences outside of strings as "valid" inline images. This allows the possibility (miniscule?) of finding false-positive inline image sequences.

I've no data to back it up, but I believe the second case, where formatContent() finds a BI ... ID ... EI sequence outside of a string, but in error, is a probably rarer than an inline image dictionary not containing a height and width. But then again I could be wrong!

Regardless, I would recommend keeping the dictionary check just in case. If it gets released and users find the array-access error again, then we can always remove it. In this case, this PR is ready to be taken out of draft status as-is.

k00ni commented 1 month ago

Thank you very much @GreyWyvern

smalot / pdfparser

Account for inline images in formatContent() #693

Type of pull request

About

Checklist for code / configuration changes