smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.33k stars 535 forks source link

Parsed output from PDF Identity-h/CID-Fonts is not readable #534

Open seven21 opened 2 years ago

seven21 commented 2 years ago

Description:

I have two PDFs from my bank account, one is using ANSI fonts and since 2 months they are using identify-h CID Fonts. The PDF Parser doesn't return readable text anymore. Is there any config/option to parse with this font.

PDF input

Sorry not possible as business bank statement

Expected output & actual output

The first two lines with the expected output from older PDFs look like this: Postbank Card Service Hamburg

Now it looks like this: 3RVWEDQN &DUG6HUYLFH+DPEXUJ

Code

$parser = new \Smalot\PdfParser\Parser(); $pdf = $parser->parseFile($pdf_file);

$text = $pdf->getText();

k00ni commented 2 years ago

The PDF Parser doesn't return readable text anymore

Did it work in the past? If so, which version was it?

seven21 commented 2 years ago

The Parser worked only with the PDFs without CID Fonts. Parser has not changed. The PDF has changed. I will add a screenshot where you can see the different fonts used. Seems to be that the parser can not read CID fonts or especially the identity-h font. In the new PDF both fonts are used. The text with Ansi/Helvetica is parsed and readable, the text with CID/Identity-H leads to the unreadable output.

Dokumenteigenschaften 2022-05-11 16-20-31

seven21 commented 2 years ago

I have prepared two PDFs to reproduce the issue. First works, second not.

pdf-ansi.pdf pdf-indentiy-h.pdf

KurMaciek commented 2 years ago

Hey! I have the same problem. Do you have a solution to this problem?

Zrzut ekranu 2022-05-15 o 22 40 34
CitizenDev commented 1 year ago

Same issue over here. Mangled results for Type Truetype (CID) Encoding: Identity-H

hebinet commented 1 year ago

I have the same issue when parsing a pdf with CID and Identity-H

PHP-Version: 8.1 PDFParser-Version: 2.2.1

Any updates on this?

k00ni commented 1 year ago

Not to my knowledge, sorry.