smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.41k stars 536 forks source link

Issue loading pdf generated from FPDI #472

Open eddturtle opened 3 years ago

eddturtle commented 3 years ago

Having some difficultly loading a pdf from dev-master which I'm doing at the moment due to needing the code change in #450

The pdf was created in FPDI.

When calling getDetails() I'd expect to be able to get the MediaBox of the page, but it's not there any more.

$parser = new Parser();
$pdf = $parser->parseFile(__DIR__.'/../../test2.pdf');
$pages = $pdf->getPages();
foreach ($pages as $i => $page) {
    $details = $page->getDetails();
    var_dump($details);
}

Any help would be amazing

Pdf used: test2.pdf

Output of getDetails:

array(5) {
  ["Type"]=>
  string(4) "Page"
  ["Parent"]=>
  array(2) {
    ["Type"]=>
    string(5) "Pages"
    ["Count"]=>
    string(1) "1"
  }
  ["Resources"]=>
  array(0) {
  }
  ["Group"]=>
  array(3) {
    ["Type"]=>
    string(5) "Group"
    ["S"]=>
    string(12) "Transparency"
    ["CS"]=>
    string(9) "DeviceRGB"
  }
  ["Contents"]=>
  array(2) {
    ["Filter"]=>
    string(11) "FlateDecode"
    ["Length"]=>
    string(3) "106"
  }
}
izabala commented 3 years ago

Hi, I believe I could help on this one. First, be sure you are using the last code, because I create a workaround/fix, for work with: Page::getDataTm() and Page::getTextXY() and related methods, with documents generated with FPDI/FPDF (Issue #454 corrected by the pull #455).

If you have that code, you now has some new methods. You can use: Page::createPageForFpdf() or Page::getPDFObjectForFpdf() for getting a Page (or a PDFObject, depending the method you use) on which you can use the getDetails method. No you will have a BBox instead of a MediaBox, but I believe you can use it. (I dont know why FPDI do that, but it was the same with the problem we corrected in the fix/workaround).

By the way, if you need to know if a document was generated with FPDI/FPDF there is also another method (Page::isFpdf()) that returns true if the document was generated with FPDI/FPDF, so you can make a conditional for getting the MediaBox or BBox info. I made this test:

$parser = new Parser();
$pdf = $parser->parseFile("c:/xampp/htdocs/pdfBug/TEST2.pdf");
$pages = $pdf->getPages();
$page = $pages[0];
$newPage = $page->createPageForFpdf();
$details = $newPage->getDetails();
print_r($details);

The Output of the getDetails():

Array
(
    [Type] => XObject
    [Subtype] => Form
    [FormType] => 1
    [BBox] => Array
        (
            [0] => 0
            [1] => 0
            [2] => 595.28
            [3] => 841.89
        )
    [Group] => Array
        (
            [Type] => Group
            [S] => Transparency
        )
    [Resources] => Array
        (
            [ProcSet] => Array
                (
                    [0] => PDF
                    [1] => Text
                )
            [Font] => Array
                (
                    [F1] => Array
                        (
                            [Name] => Helvetica
                            [Type] => Type1
                            [Encoding] => WinAnsiEncoding
                            [Subtype] => Type1
                            [BaseFont] => Helvetica
                        )
                    [F2] => Array
                        (
                            [Name] => DroidSans-Bold
                            [Type] => Type0
                            [Encoding] => Identity-H
                            [Subtype] => Type0
                            [BaseFont] => DroidSans-Bold
                        )
                )
        )
    [Filter] => FlateDecode
    [Length] => 167
)

Please let me know if using the methods, with the BBox info (instead of the MediaBox) could help you.