smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.3k stars 534 forks source link

MS PDF Printer Chrome 1.7: getText() results in empty text #696

Closed pud-micha closed 3 months ago

pud-micha commented 3 months ago

Description:

Any "pdf printed" website with MS PDF Printer / "save as pdf". It creates a 1.7 PDF. Like the file attached. test.pdf

$parser->parseFile('website.pdf')->getText() is empty.

PDF input

Print any website content does not matter. Windows 10. Chrome 123.0.6312.86.

Expected output & actual output

Some Text but empty.

Code

$parser->parseFile('website.pdf')->getText()

GreyWyvern commented 3 months ago

There is no text in the sample document. Looks like the page was saved as an image; OCR would be required to read this.

pud-micha commented 3 months ago

I haven't assumed that other PDF printers did not print text from HTML. Thank you :) Now I have a solution: Use print as PDF and don't use MS PDF printer.