smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.41k stars 537 forks source link

getDataTm returns empty array for one page only #717

Open mrasmith opened 5 months ago

mrasmith commented 5 months ago

Description:

I have a muti page pdf (about 90 pages) . All pages contain table of similar data and all parse with getDataTm without issue except one, getDataTm returns an empty array on the problem page. I've attached two pages of the document, the second page is the one that one won't parse.

PDF input

rearranged.pdf

Expected output & actual output

result from page 1:

Array
(
    [0] => Array
        (
            [0] => Array
                (
                    [0] => 1
                    [1] => 0
                    [2] => 0.000000
                    [3] => -1
                    [4] => 377.119995
                    [5] => 33.279999
                )

            [1] => SNAPSHOT - MERLIN CAR AUCTIONS - JANUARY TO APRIL2024
        )

etc

result from page 2: Array ()

Code

$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('rearranged.pdf');

$data = $pdf->getPages()[1]->getDataTm();
FredWolk commented 3 months ago

I have a similar problem. I can't get the text from one page out of 10. And this happens in different files. Some files are parsed normally. Others with this problem.

kostjerry commented 3 months ago

+1 to the problem