smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.41k stars 537 forks source link

Rect value for some PDF form fields doesn't contain floating point numbers but rather strings like #Obj#32_0 #738

Open prescriptionlifeline opened 1 month ago

prescriptionlifeline commented 1 month ago

Description:

The attached PDF contains 3x form fields. The Rect key for one of those form fields contains 4x floating point numbers, presumably representing the coordinates of that form field on the PDF, however, the other two form fields contain strings - #Obj#32_0 - #Obj#35_0. idk what these strings means and if there's a way to get coordinates from those strings it's unclear to me what that method might be.

PDF input

test.pdf

Expected output

Array
(
    [P] => Array
        (
            [Type] => Page
            [Rotate] => 0
        )

    [T] => incomeName1
    [Rect] => Array
        (
            [0] => 205.35
            [1] => 717.601
            [2] => 540.298
            [3] => 740.74
        )

    [F] => 4
    [Type] => Annot
    [Subtype] => Widget
    [DA] => /Helv 12 Tf 0 g
    [MK] => Array
        (
        )

    [FT] => Tx
)

Actual output

Array
(
    [P] => Array
        (
            [Type] => Page
            [Rotate] => 0
        )

    [T] => PatientsIncomeSS
    [Rect] => Array
        (
            [0] => #Obj#32_0
            [1] => #Obj#33_0
            [2] => #Obj#34_0
            [3] => #Obj#35_0
        )

    [F] => 4
    [Type] => Annot
    [Subtype] => Widget
    [DA] => /Helv 12 Tf 0 g
    [MK] => Array
        (
        )

    [FT] => Tx
)

Code

<?php
include('vendor/autoload.php');

$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('test.pdf');

$objects = $pdf->getObjects();
foreach ($objects as $obj) {
    print_r($obj->getDetails());
}