smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.3k stars 534 forks source link

Implement missing cm command #720

Open DominikDostal opened 3 weeks ago

DominikDostal commented 3 weeks ago

Type of pull request

About

When printing a landscape word document as pdf, it will be created with a concatenation matrix (cm command) which is completely ignored by the pdfparser. This PR aims to add it.

I tested this with this document: This is just a test - Printed As.pdf

With this code:

$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseContent(file_get_contents("C:\Users\dd\Desktop\This is just a test - Printed As.pdf"));
$pages = $pdf->getPages();
foreach($pages as $page) {
    $pageData = $page->getDataTm();
    var_dump('<pre>', $pageData, '</pre>');
}

Before i got these results for the text:

  array(2) {
    [0]=>
    array(6) {
      [0]=>
      string(1) "1"
      [1]=>
      string(1) "0"
      [2]=>
      string(8) "0.000000"
      [3]=>
      string(2) "-1"
      [4]=>
      string(9) "78.879997"
      [5]=>
      string(10) "126.559998"
    }
    [1]=>
    string(20) "This is just a test "
  }

After my change i get the following result:

  array(2) {
    [0]=>
    array(6) {
      [0]=>
      float(0.75)
      [1]=>
      float(0)
      [2]=>
      float(0)
      [3]=>
      float(0.75)
      [4]=>
      float(59.159997750000002)
      [5]=>
      float(500.40000850000001)
    }
    [1]=>
    string(20) "This is just a test "
  }

If i open the pdf in adobe acrobat reader and measure the distances: image

Converted to PDF units: ~20.02mm 2.83465 = ~56,75 ~176.06mm 2.83465 = ~499 (deviation is because i didnt measure it precise enough in adobe)

DominikDostal commented 3 weeks ago

It seems that i will have to test it with the documents from the unit tests and see why the result differs.

Will do that when i find the time

DominikDostal commented 3 weeks ago

So the problem was that i also needed to implement the q and Q commands to save/load the graphics state to/from the stack.

The values are getting closer, but because of math inaccuracy they still differ:

2) PHPUnitTests\Integration\PageTest::testGetDataTm
Failed asserting that two arrays are equal.
--- Expected
+++ Actual
@@ @@
 Array (
-    0 => '0.999429'
-    1 => '0'
-    2 => '0'
-    3 => '1'
-    4 => '201.96'
-    5 => '720.68'
+    0 => 0.9994286002284
+    1 => 0.0
+    2 => 0.0
+    3 => 0.9999996
+    4 => 201.959919216
+    5 => 720.6797117279999
 )

The question is how i should handle this: 1) Use the new float values as the expected result 2) Try to change how the result is returned (round, convert to string)

Can anyone give me some advice here?

k00ni commented 3 weeks ago

@DominikDostal thank you for your effort. I don't have much time right now, but will see that I can back to you soon.

Please provide reference(s) to the PDF specification you are referring to or on which your code is based on.

The question is how i should handle this:

  1. Use the new float values as the expected result
  2. Try to change how the result is returned (round, convert to string) Can anyone give me some advice here?

I am not sure if I got your point. At first glance, it seems to be an improvement to have more precise float values instead of rather less precise strings. Rounding should be avoided though, because it may interfere/alter peoples results.

DominikDostal commented 3 weeks ago

Please provide reference(s) to the PDF specification you are referring to or on which your code is based on.

I based it on https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf Chapter 8.4.4 Graphics State Operators. image

I am not sure if I got your point. At first glance, it seems to be an improvement to have more precise float values instead of rather less precise strings. Rounding should be avoided though, because it may interfere/alter peoples results.

I guess the document used for some of the PageTests is a special case: Document1_pdfcreator_nocompressed.pdf text positions are extracted without using the concatenating matrices inside. So why were the coordinates correct? Because it uses 2 different concatenating matrices:

When you multiply them, the result is: [0.9999996 0 0 0.9999996 0 0] which is very close to the matrix that changes nothing when multiplied with: [1 0 0 1 0 0]

Thats why im not sure if not rounding is a good approach, because it looks to me like it was supposed to be a periodic number, but was cut short because of rounding.

This is btw how the calculation in the first part of the document looks like: The CTM (Current Transformation Matrix) starts as [1 0 0 1 0 0])

q - Saved CTM to stack: [1 0 0 1 0 0]
cm - Adding concat matrix: [0.12 0 0 0.12 0 0] - CTM * cm = [0.12 0 0 0.12 0 0]
q - Saved CTM to stack: [0.12 0 0 0.12 0 0]
cm - Adding concat matrix: [8.33333 0 0 8.33333 0 0] - CTM * cm = [0.9999996 0 0 0.9999996 0 0]
BT
Tf
Tm - Text Matrix: [0.999429 0 0 1 201.96 720.68] - Tm * CTM = [0.9994286002284 0 0 0.9999996 201.959919216 720.679711728]
TJ - Text: Document title
ET
Q - Loaded CTM from stack: [0.12 0 0 0.12 0 0]
GreyWyvern commented 3 weeks ago

I already went through all of this with the PDFObject parsing to account for cm commands, so I know how you feel! getDataTm is a completely separate parsing function that also parses the document stream data so my changes didn't touch that bit.

I really wonder if in the future we can add a global document stream parser that generates the data required by both functions/objects and thus obsoletes both of these.

DominikDostal commented 3 weeks ago

I already went through all of this with the PDFObject parsing to account for cm commands, so I know how you feel! getDataTm is a completely separate parsing function that also parses the document stream data so my changes didn't touch that bit.

I really wonder if in the future we can add a global document stream parser that generates the data required by both functions/objects and thus obsoletes both of these.

Oh i see. But... isnt it wrong to just overwrite the old concatenating matrix when the cm command comes along? In my code i added them together (or rather multiplied them together) and at least in the test-pdf that is already in the repository it got the correct result. https://github.com/smalot/pdfparser/blob/a3e213d4c656fb837f3ec329f5f0c0ef887d1afc/src/Smalot/PdfParser/PDFObject.php#L769

GreyWyvern commented 3 weeks ago

Oh i see. But... isnt it wrong to just overwrite the old concatenating matrix when the cm command comes along?

You're right, of course, and I should add that. But my end didn't have to worry about also storing the exact positioning matrix along with each bit of text. It only needed to know did it move enough to warrant adding a line-break. :)

k00ni commented 1 week ago

@DominikDostal I am not sure, where we stand here. Please tell me which points are open to discuss and what you need to finalize the PR.

Thats why im not sure if not rounding is a good approach, because it looks to me like it was supposed to be a periodic number, but was cut short because of rounding.

Well, in general I would keep as much data as provided. Sure in this case, rounding seems obvious but it may have different ramifications in other cases. Therefore my statement about keep raw data. Developers can round the values themselves if they want.

Can you provide a test which fails without your changes?