smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.41k stars 536 forks source link

gettext empty result #652

Open bigmoney99 opened 1 year ago

bigmoney99 commented 1 year ago

Hello, Iwant to extract this pdf, but the result is empty. https://www.mediafire.com/file/azb7yddqo2ry55j/123.pdf/file

this is my code

$parser = new \Smalot\PdfParser\Parser(); // Parse pdf file using Parser library 
$pdf = $parser->parseFile($file);
$metaData = $pdf->getDetails();
print_r($metaData); 
$pages  = $pdf->getPages();
foreach ($pages as $page) {
            $text = $page->getText();
            echo "<div>".$text."</div>";
}
echo $file;

the result just

Array
(
    [Producer] => cairo 1.17.4 (https://cairographics.org
    [Pages] => 1
)
<div></div>D:\web\D\public\pdf_po/123.pdf
GreyWyvern commented 1 year ago

Issue seems to appear both in 2.7.0 and 2.8.0rc. For some reason no text content sections are found and delivered to formatContent() to parse. Text is selectable from within a PDF reader, so there is text there. More research is needed.

ADS971 commented 6 months ago

Hello, I have the same problem with this pdf file: https://www.ipgp.fr/wp-content/uploads/2024/05/OVSG20240508_RessTecto_Guadeloupe.pdf

My code: $parser = new \Smalot\PdfParser\Parser(); // Parse pdf file using Parser library $pdf = $parser->parseFile($file); $metaData = $pdf->getDetails(); print_r($metaData); $pdf->getPages()[0]->getText(); echo "<div>".$text."</div>";

The result: `Array ( [Producer] => cairo 1.17.4 (https://cairographics.org [Pages] => 1 )

`