smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.37k stars 538 forks source link

gettext empty result #652

Open bigmoney99 opened 10 months ago

bigmoney99 commented 10 months ago

Hello, Iwant to extract this pdf, but the result is empty. https://www.mediafire.com/file/azb7yddqo2ry55j/123.pdf/file

this is my code

$parser = new \Smalot\PdfParser\Parser(); // Parse pdf file using Parser library 
$pdf = $parser->parseFile($file);
$metaData = $pdf->getDetails();
print_r($metaData); 
$pages  = $pdf->getPages();
foreach ($pages as $page) {
            $text = $page->getText();
            echo "<div>".$text."</div>";
}
echo $file;

the result just

Array
(
    [Producer] => cairo 1.17.4 (https://cairographics.org
    [Pages] => 1
)
<div></div>D:\web\D\public\pdf_po/123.pdf
GreyWyvern commented 10 months ago

Issue seems to appear both in 2.7.0 and 2.8.0rc. For some reason no text content sections are found and delivered to formatContent() to parse. Text is selectable from within a PDF reader, so there is text there. More research is needed.

ADS971 commented 4 months ago

Hello, I have the same problem with this pdf file: https://www.ipgp.fr/wp-content/uploads/2024/05/OVSG20240508_RessTecto_Guadeloupe.pdf

My code: $parser = new \Smalot\PdfParser\Parser(); // Parse pdf file using Parser library $pdf = $parser->parseFile($file); $metaData = $pdf->getDetails(); print_r($metaData); $pdf->getPages()[0]->getText(); echo "<div>".$text."</div>";

The result: `Array ( [Producer] => cairo 1.17.4 (https://cairographics.org [Pages] => 1 )

`