smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.42k stars 538 forks source link

Incorrect parsing, get empty text #658

Open ishowshao opened 11 months ago

ishowshao commented 11 months ago

Description:

PDF input

https://cdn.yinyuezhushou.com/static/7d38770d31c3cd66219eaa1b7959e2dd.pdf

Expected output & actual output

Expected output: the text in file

Code

try {
        $parser = new \Smalot\PdfParser\Parser();
        $pdf = $parser->parseFile($path);
        $text = $pdf->getText();
        return preg_replace('/\s+/', '', $text);
    } catch (Exception $e) {
        $logger = self::getLogger('pdf2text');
        $logger->warning($e->getMessage(), ['path' => $path]);
        return '';
    }
k00ni commented 11 months ago

Please try again with 2.8.0-RC2 and get back to us.

linayu commented 11 months ago

I also encountered the same problem when reading Chinese text. In PDF version 1.4, Chinese characters can be read.

PDF version: 1.6(Acrobat 7.x)