smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.3k stars 534 forks source link

PdfParser does not consider the entire document stream #628

Closed GreyWyvern closed 8 months ago

GreyWyvern commented 11 months ago

https://github.com/smalot/pdfparser/blob/2608ac3c0db64802ffbe0ce648de7dd0825f2b5d/src/Smalot/PdfParser/PDFObject.php#L195-L201

Above is the code in PDFObject.php that extracts lines from a document stream to determine what to display. It only considers content between BT and ET commands (and maybe a Q or q on either side) to be valid commands. However, many valid commands such as cm (graphics position affecting initial position of BT) and Tf (font changes) can and do occur outside of BT ... ET blocks. Even q and Q occur regularly in streams and not just adjacent to BT ... ET. In order to more correctly display the content of a PDF, the entire stream must be used, with mainly graphics-related commands able to be ignored.

As well, q and Q are currently handled in a two state manner. If a q is encountered, the state is saved; if a Q is encountered, the saved state is restored. This does not account for the fact that multiple states can be saved and restored in a stack in a push/pop manner. Both fonts (Tf) and graphics positions (cm) should be stored in this fashion.

https://github.com/smalot/pdfparser/blob/2608ac3c0db64802ffbe0ce648de7dd0825f2b5d/src/Smalot/PdfParser/PDFObject.php#L387-L390

Affect on Positioning

In addition to ignoring cm positioning commands, PdfParser's treatment of Tm (set text matrix) and Td/TD (set text current point) does not take into account the full matrix position of 6 values. In the following example stream commands:

0.8 0 0 0.8 100 100 Tm
200 200 Td
(Hello World)Tj

... PdfParser only considers the 100 100 from the Tm command and sets that as the current text position. Then it sees the 200 200 from the Td and overwrites the current text position so it is now 200 200. The correct positioning interpretation is the following:

Fortunately we can ignore the graphics size ratios from the cm commands as they only affect graphics commands. :)

I'm preparing a PR that will essentially completely re-write the cleanContent(), getSectionsText(), getText(), and getCommandsText() methods from PDFObject.php (as well as a couple minor changes in Font.php and Page.php) to switch to this new way of interpreting the document stream. It is an extensive change which I hope gets a lot of scrutiny! Already in my test environment it is passing all unit tests except one, and resolves a large number of open issues.

Opening this issue for discussion purposes, and I may start tagging issues here that will (hopefully) be resolved by the change.

GreyWyvern commented 11 months ago

Putting this down just so I don't lose it...

Okay, so I've just gone through the whole list of issues, and where sample PDFs were available, I tested them using my new setup. The updated code I'm working on resolves the following issues (not tagged for cleanliness).

110, 149, 261, 353, 387, 398, 458, 508, 527, 528, 542, 551, 564, 568, 575, 576, 585, 607, 608, 628

In addition to the above, I can't verify whether it's my update that has fixed these, but they are resolved.

All tests are now passing, however I did have to modify several since the way the script handles and parses the document stream is now different.

GreyWyvern commented 11 months ago

I've gone through the list again, once with my updated setup, and once with the latest release v2.7.0 to get a definitive list of all the issues this change will fix. And here it is:

353, #398, #464, #474, 491, #508, #528, #537, #564, #568, #575, #576, #585, #608 and this issue 628.

Now I just need to write tests for these, lol.

GreyWyvern commented 10 months ago

I've committed my first set of changes here: https://github.com/GreyWyvern/pdfparser

There are still more tests needed, but at least you can try out the changes from the repo. :)

k00ni commented 10 months ago

There are still more tests needed, but at least you can try out the changes from the repo. :)

I suggest you create a PR regardless, because it will be easier to follow changes and discuss them. A draft PR will be sufficient at the beginning.