smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.3k stars 534 forks source link

[]TJ command parsed improperly #710

Closed DisabledMonkey closed 1 month ago

DisabledMonkey commented 1 month ago

Description:

Pdf parses incorrectly for a pdf with []TJ commands, resulting in partial command getting returned as part of the parsed text.

Per this section of code https://github.com/smalot/pdfparser/blob/14adf318f8620a6195c0b00d51c6a507837b9ff4/src/Smalot/PdfParser/PDFObject.php#L358-L365 It sounds like each command is supposed to be returned as a single line of text. But looking at the results of that, we can see the formatContent method returned that []TJ command across several lines image

PDF input

Is a document containing financial information for the company i work for so I can't provide it.

Expected output & actual output

Expected: actual English text Actual: a partial of the []TJ command is returned image In version 2.7 the pdf in question parsed correctly but hasn't in any version since.

Code

While this might not be the proper way to fix the problem.... i saw the dictionary command had a similar thing accounted for, so added a bit of code to do the same thing for this TJ command and it fixed my output

while (preg_match('/(\[.*?\] *)(TJ)/', $content, $dicttext)) {
    $dictid = uniqid('DICT_', true);
    $dictstore[$dictid] = $dicttext[1];
    $content = preg_replace(
        '/'.preg_quote($dicttext[0], '/').'/',
        ' ###'.$dictid.'###'.$dicttext[2],
        $content,
        1
    );
}
GreyWyvern commented 1 month ago

This likely has something to do with some representation of newlines in strings that isn't being escaped. I'll need to know the initial state of the document stream @DisabledMonkey.

Please add the following line in your PDFObject.php as the first line of the formatContent() function:

var_dump($content);

And let me know what output you see, specifically around this TJ command. Thanks.

DisabledMonkey commented 1 month ago

Trying to look at just a small portion, this is what it looks like before any processing in that formatContent() function

[(.)35.2013(\r)73.2169(\x05)39.5429(\r)73.2169(\n)18.7748(\x1E)-5.3566(\x1F)54.1166(\n)4.20113(#)13.7749(\v)36.1006(\x1E)9.21705(\t)91.217(\x17)83.1011(\x17)83.1011(\x02)74.2167(\x1E)9.21705(\x06)57.1009(\x1F)39.5421(\t)91.217(\n)18.7757(\x03)446]TJ

so does seem like the (\n) in there are what cause the problem, so escaping those in some manner should hopefully fix it

GreyWyvern commented 1 month ago

Thanks, but there is something that's not showing up here since I can copy paste that string into the unit tests and it parses properly.

Change the added line to var_dump(bin2hex($content)); then do a search for 5b282e2933352e32303133285c722937332e3231 in the output and paste 500 characters here, beginning from the matched text.

DisabledMonkey commented 1 month ago

here you go:

5b282e2933352e32303133285c722937332e3231363928052933392e35343239285c722937332e32313639285c6e2931382e37373438281e292d352e33353636281f2935342e3131365b282e2933352e32303133285c722937332e323136285c6e29342e323031313328232931332e37373439280b2933362e31303036281e29392e323137303528092939312e32313728172938332e3130313128172938332e3130313128022937342e32313637281e29392e323137303528062935372e31303039281f2933392e3534323128092939312e323137285c6e2931382e373735372803293434365d544a0a45540a510a302e36323839303620670a

thanks

GreyWyvern commented 1 month ago

Does the match appear more than once? The string you posted doesn't quite match with the one from your previous post.

DisabledMonkey commented 1 month ago

so i dug through until eventually i was able to find a portion that trips it up, so almost seems like something higher in the pdf causes it to break in future parts or something.

So this chunk here should at least return one bad []TJ block

44503c3c2f507265646963746f722031350a2f436f6c756d6e732031320a2f436f6c6f727320333e3e0a494420789c637cfefafdcea3171980e00703047c40625b1b8a9bea6b322ed87020dedf1e24f093e12344e60744dd47202320a3e0c28e053045d85400190950452b0ec407d863550124130a608a0202ecb1aa0002a8a2092b0e240015fd607088486440020b264cf8f083a1a002aec8c31ed38c0f6012a128c0c31eab0a54450e06585500b9050d70451606585500051b208a4e5fbc9e5adec98003ccee2c0785382e69640000f755840d0a454920510a3020670a710a382e33333333332030203020382e33333333332030203020636d2042540a2f52313420382e32352054660a302e3939383036352030203020312035342e3731362036343520546d0a5b28072934352e373133281a2931312e37373533280b2933362e3130303628172938332e3130303228022937342e3231363728032938312e36353833280429342e323031313328052935342e3131363628062935372e31303039281f2935342e31313636285c6e29342e323031313328232931332e37373439280b2933362e31303036281d292d302e373938373735285c722937332e3231363928052935342e31313636285c722935382e36343333285c6e2931382e37373438281e29392e32313739342802293532365d544a0a45540a510a302e36323839303620670a3330302035333331203236383820362072650a660a710a343338382035333631203339332038332072652057206e0a3020670a710a382e33333333332030203020382e33333333332030203020636d20
GreyWyvern commented 1 month ago

In your line-numbered screenshot from the OP, can you provide lines 525 to 600?

DisabledMonkey commented 1 month ago

Sorry for all the back and forth. Obviously difficult to pick and chose pieces of it.

My boss gave me permission to share a similar doc that has the same problem from 2018

GreyWyvern commented 1 month ago

Thanks for that. I cannot reproduce the broken TJ command behaviour with my current copy of PdfParser, but your document does contain inline images. Can you check whether this issue might be resolved by the recently merged #693?

If not that, it might have something to do with your PHP version of 7.2 perhaps?

DisabledMonkey commented 1 month ago

Can confirm master branch is working, so does appear like the changes in #693 fixed this. I should have just waited a few more days i guess.

Thanks for your time and assistance.