smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.3k stars 534 forks source link

Fix returning empty text in some cases #666

Closed xAzoom closed 4 months ago

xAzoom commented 5 months ago

Type of pull request

Bug fix (involves code and configuration changes)

About

In some cases, we may have received empty text, due to an invalid !\in_array($xobject->getUniqueId(), self::$recursionStack) comparison in PDFObject (e.g., in_array('0000000000000e110000000000000000', ['0000000000000e030000000000000000']) will return true). For more information, see https://www.php.net/manual/en/function.spl-object-hash.php notes. In this PR strict parameter to in_array has been added.

xAzoom commented 5 months ago

Thank you @xAzoom for this PR!

Please provide at least one unit test to cover this change. Or is there already one?

Unfortunately, I can't mock the internal protected getUniqueId() function in PDFObject to ensure that the comparison is already correct. But I think it's most important that this change doesn't break any part of the library and passes all tests.

Is it possible to release v2.7.1 with this fix?

k00ni commented 5 months ago

Unfortunately, I can't mock the internal protected getUniqueId() function in PDFObject to ensure that the comparison is already correct. But I think it's most important that this change doesn't break any part of the library and passes all tests.

I also have no idea currently how to mock this code part. It might be possible, but is not worth my time.

Can you provide a PDF which triggered the problem before this fix? That would also fit.

Is it possible to release v2.7.1 with this fix?

It could be part of our next release (v2.8.1) when its ready. But I can give you no fixed date.

GreyWyvern commented 5 months ago

After examining this PR, my first thought was: "There's gotta be more than just two places in PdfParser that use in_array()!" So I checked, and it is in fact real. These are the only two places in the whole source that use that function. Amazing!

While it would be easy to create a test to directly compare strict vs. non-strict in_array(), the two locations changed are extracting id text from the document to use. So it would be much easier if we had a test document to use for the test suite.

If that's impossible, I can see if I can somehow create some fake document code that tests this, but I ain't looking forward to it!

k00ni commented 5 months ago

Please merge in master branch, to get rid of these coding style issues.

xAzoom commented 5 months ago

Unfortunately, I can't mock the internal protected getUniqueId() function in PDFObject to ensure that the comparison is already correct. But I think it's most important that this change doesn't break any part of the library and passes all tests.

I also have no idea currently how to mock this code part. It might be possible, but is not worth my time.

Can you provide a PDF which triggered the problem before this fix? That would also fit.

Is it possible to release v2.7.1 with this fix?

It could be part of our next release (v2.8.1) when its ready. But I can give you no fixed date.

Here I also have the bad news, it totally does not depend on the PDF document, but rather on the structure of the code. My quick fix in my project was to move one of the test methods up, and all tests passed (I didn't actually change any line of code, just reordered the methods). So I suppose spl_object_hash works in a deterministic way, but it will be difficult to simulate the creation of hashes like "0000000000000e030000000000000000000000000000".

k00ni commented 4 months ago

Sorry for the delay here. We recently fixed Scrutinizer which provides code coverage. Its report says that the lines in question are covered by

I think its reasonable to merge it without any tests.