Closed DominikDostal closed 4 months ago
It seems that i will have to test it with the documents from the unit tests and see why the result differs.
Will do that when i find the time
So the problem was that i also needed to implement the q and Q commands to save/load the graphics state to/from the stack.
The values are getting closer, but because of math inaccuracy they still differ:
2) PHPUnitTests\Integration\PageTest::testGetDataTm
Failed asserting that two arrays are equal.
--- Expected
+++ Actual
@@ @@
Array (
- 0 => '0.999429'
- 1 => '0'
- 2 => '0'
- 3 => '1'
- 4 => '201.96'
- 5 => '720.68'
+ 0 => 0.9994286002284
+ 1 => 0.0
+ 2 => 0.0
+ 3 => 0.9999996
+ 4 => 201.959919216
+ 5 => 720.6797117279999
)
The question is how i should handle this: 1) Use the new float values as the expected result 2) Try to change how the result is returned (round, convert to string)
Can anyone give me some advice here?
@DominikDostal thank you for your effort. I don't have much time right now, but will see that I can back to you soon.
Please provide reference(s) to the PDF specification you are referring to or on which your code is based on.
The question is how i should handle this:
- Use the new float values as the expected result
- Try to change how the result is returned (round, convert to string) Can anyone give me some advice here?
I am not sure if I got your point. At first glance, it seems to be an improvement to have more precise float values instead of rather less precise strings. Rounding should be avoided though, because it may interfere/alter peoples results.
Please provide reference(s) to the PDF specification you are referring to or on which your code is based on.
I based it on https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf Chapter 8.4.4 Graphics State Operators.
I am not sure if I got your point. At first glance, it seems to be an improvement to have more precise float values instead of rather less precise strings. Rounding should be avoided though, because it may interfere/alter peoples results.
I guess the document used for some of the PageTests is a special case:
Document1_pdfcreator_nocompressed.pdf
text positions are extracted without using the concatenating matrices inside.
So why were the coordinates correct?
Because it uses 2 different concatenating matrices:
When you multiply them, the result is: [0.9999996 0 0 0.9999996 0 0] which is very close to the matrix that changes nothing when multiplied with: [1 0 0 1 0 0]
Thats why im not sure if not rounding is a good approach, because it looks to me like it was supposed to be a periodic number, but was cut short because of rounding.
This is btw how the calculation in the first part of the document looks like: The CTM (Current Transformation Matrix) starts as [1 0 0 1 0 0])
q - Saved CTM to stack: [1 0 0 1 0 0]
cm - Adding concat matrix: [0.12 0 0 0.12 0 0] - CTM * cm = [0.12 0 0 0.12 0 0]
q - Saved CTM to stack: [0.12 0 0 0.12 0 0]
cm - Adding concat matrix: [8.33333 0 0 8.33333 0 0] - CTM * cm = [0.9999996 0 0 0.9999996 0 0]
BT
Tf
Tm - Text Matrix: [0.999429 0 0 1 201.96 720.68] - Tm * CTM = [0.9994286002284 0 0 0.9999996 201.959919216 720.679711728]
TJ - Text: Document title
ET
Q - Loaded CTM from stack: [0.12 0 0 0.12 0 0]
I already went through all of this with the PDFObject parsing to account for cm
commands, so I know how you feel! getDataTm
is a completely separate parsing function that also parses the document stream data so my changes didn't touch that bit.
I really wonder if in the future we can add a global document stream parser that generates the data required by both functions/objects and thus obsoletes both of these.
I already went through all of this with the PDFObject parsing to account for
cm
commands, so I know how you feel!getDataTm
is a completely separate parsing function that also parses the document stream data so my changes didn't touch that bit.I really wonder if in the future we can add a global document stream parser that generates the data required by both functions/objects and thus obsoletes both of these.
Oh i see. But... isnt it wrong to just overwrite the old concatenating matrix when the cm command comes along? In my code i added them together (or rather multiplied them together) and at least in the test-pdf that is already in the repository it got the correct result. https://github.com/smalot/pdfparser/blob/a3e213d4c656fb837f3ec329f5f0c0ef887d1afc/src/Smalot/PdfParser/PDFObject.php#L769
Oh i see. But... isnt it wrong to just overwrite the old concatenating matrix when the cm command comes along?
You're right, of course, and I should add that. But my end didn't have to worry about also storing the exact positioning matrix along with each bit of text. It only needed to know did it move enough to warrant adding a line-break. :)
@DominikDostal I am not sure, where we stand here. Please tell me which points are open to discuss and what you need to finalize the PR.
Thats why im not sure if not rounding is a good approach, because it looks to me like it was supposed to be a periodic number, but was cut short because of rounding.
Well, in general I would keep as much data as provided. Sure in this case, rounding seems obvious but it may have different ramifications in other cases. Therefore my statement about keep raw data. Developers can round the values themselves if they want.
Can you provide a test which fails without your changes?
@DominikDostal Please give us an update how you want to proceed here and cover the remaining points.
@k00ni Im sorry about not being active here, I was sick in between and had to catch up with work (and then forgot 😞 )
I pushed a commit that adds a new document to test CM Commands, but i also had to fix some existing tests (since now we get more decimal numbers than before). I wasnt sure if rounding the results or changing the expected results was better.
I also had to add an error margin of 0.01 to the getTextXY
tests because of the new results.
Can I assume that your new test fails when using a prior version without your changes?
Yes
The results will be "slightly" different:
There was 1 failure:
1) PHPUnitTests\Integration\PageTest::testCmCommandInPdfs
Failed asserting that two arrays are equal.
--- Expected
+++ Actual
@@ @@
Array (
- 0 => '0.75'
- 1 => '0.0'
- 2 => '0.0'
- 3 => '0.75'
- 4 => '59.16'
- 5 => '500.4'
+ 0 => 1.0
+ 1 => 0.0
+ 2 => 0.0
+ 3 => -1.0
+ 4 => 78.88
+ 5 => 126.56
)
D:\source\pdfparser\tests\PHPUnit\Integration\PageTest.php:957
@DominikDostal Is there anything left (to discuss)?
Type of pull request
About
When printing a landscape word document as pdf, it will be created with a concatenation matrix (cm command) which is completely ignored by the pdfparser. This PR aims to add it.
I tested this with this document: This is just a test - Printed As.pdf
With this code:
Before i got these results for the text:
After my change i get the following result:
If i open the pdf in adobe acrobat reader and measure the distances:
Converted to PDF units: ~20.02mm 2.83465 = ~56,75 ~176.06mm 2.83465 = ~499 (deviation is because i didnt measure it precise enough in adobe)