Closed GreyWyvern closed 8 months ago
Putting this down just so I don't lose it...
Okay, so I've just gone through the whole list of issues, and where sample PDFs were available, I tested them using my new setup. The updated code I'm working on resolves the following issues (not tagged for cleanliness).
110, 149, 261, 353, 387, 398, 458, 508, 527, 528, 542, 551, 564, 568, 575, 576, 585, 607, 608, 628
In addition to the above, I can't verify whether it's my update that has fixed these, but they are resolved.
All tests are now passing, however I did have to modify several since the way the script handles and parses the document stream is now different.
I've gone through the list again, once with my updated setup, and once with the latest release v2.7.0 to get a definitive list of all the issues this change will fix. And here it is:
Now I just need to write tests for these, lol.
I've committed my first set of changes here: https://github.com/GreyWyvern/pdfparser
There are still more tests needed, but at least you can try out the changes from the repo. :)
There are still more tests needed, but at least you can try out the changes from the repo. :)
I suggest you create a PR regardless, because it will be easier to follow changes and discuss them. A draft PR will be sufficient at the beginning.
https://github.com/smalot/pdfparser/blob/2608ac3c0db64802ffbe0ce648de7dd0825f2b5d/src/Smalot/PdfParser/PDFObject.php#L195-L201
Above is the code in PDFObject.php that extracts lines from a document stream to determine what to display. It only considers content between
BT
andET
commands (and maybe aQ
orq
on either side) to be valid commands. However, many valid commands such ascm
(graphics position affecting initial position ofBT
) andTf
(font changes) can and do occur outside ofBT ... ET
blocks. Evenq
andQ
occur regularly in streams and not just adjacent toBT ... ET
. In order to more correctly display the content of a PDF, the entire stream must be used, with mainly graphics-related commands able to be ignored.As well,
q
andQ
are currently handled in a two state manner. If aq
is encountered, the state is saved; if aQ
is encountered, the saved state is restored. This does not account for the fact that multiple states can be saved and restored in a stack in a push/pop manner. Both fonts (Tf
) and graphics positions (cm
) should be stored in this fashion.https://github.com/smalot/pdfparser/blob/2608ac3c0db64802ffbe0ce648de7dd0825f2b5d/src/Smalot/PdfParser/PDFObject.php#L387-L390
Affect on Positioning
In addition to ignoring
cm
positioning commands, PdfParser's treatment ofTm
(set text matrix) andTd
/TD
(set text current point) does not take into account the full matrix position of 6 values. In the following example stream commands:... PdfParser only considers the
100 100
from theTm
command and sets that as the current text position. Then it sees the200 200
from theTd
and overwrites the current text position so it is now200 200
. The correct positioning interpretation is the following:100 100
.0.8 0.8
.Td
command, multiply them by the text size ratio, then add them to the current text position: 200 x 0.8 + 100 = 260260 260
and not200 200
.Fortunately we can ignore the graphics size ratios from the
cm
commands as they only affect graphics commands. :)I'm preparing a PR that will essentially completely re-write the
cleanContent()
,getSectionsText()
,getText()
, andgetCommandsText()
methods from PDFObject.php (as well as a couple minor changes in Font.php and Page.php) to switch to this new way of interpreting the document stream. It is an extensive change which I hope gets a lot of scrutiny! Already in my test environment it is passing all unit tests except one, and resolves a large number of open issues.Opening this issue for discussion purposes, and I may start tagging issues here that will (hopefully) be resolved by the change.