smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.3k stars 534 forks source link

getDataTm returns empty array for one page only #717

Open mrasmith opened 1 month ago

mrasmith commented 1 month ago

Description:

I have a muti page pdf (about 90 pages) . All pages contain table of similar data and all parse with getDataTm without issue except one, getDataTm returns an empty array on the problem page. I've attached two pages of the document, the second page is the one that one won't parse.

PDF input

rearranged.pdf

Expected output & actual output

result from page 1:

Array
(
    [0] => Array
        (
            [0] => Array
                (
                    [0] => 1
                    [1] => 0
                    [2] => 0.000000
                    [3] => -1
                    [4] => 377.119995
                    [5] => 33.279999
                )

            [1] => SNAPSHOT - MERLIN CAR AUCTIONS - JANUARY TO APRIL2024
        )

etc

result from page 2: Array ()

Code

$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('rearranged.pdf');

$data = $pdf->getPages()[1]->getDataTm();