smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.42k stars 538 forks source link

Strange chararacters while parsing PDF #654

Open LudovicMaillet opened 1 year ago

LudovicMaillet commented 1 year ago

Description:

Hello Very strange characters are returned by parsing bank detail PDF

PDF input

Releve_compte_31_10_2016-1.pdf

Expected output & actual output

[22-Nov-2023 15:48:38 Europe/Paris] $config: Smalot\PdfParser\Config Object ( [fontSpaceLimit:Smalot\PdfParser\Config:private] => -50 [horizontalOffset:Smalot\PdfParser\Config:private] =>
[pdfWhitespaces:Smalot\PdfParser\Config:private] => [22-Nov-2023 15:48:38 Europe/Paris] $metaData: Array ( [Author] => QIJS [CreationDate] => 2016-11-01T16:10:42+00:00 [Creator] => IBM i program CBFEXPO/B0203P [ModDate] => 2019-01-14T10:59:15+01:00 [Producer] => 2A55SAM Spool-a-Matic by Gumbo Software Inc V2R7M0 [Subject] => B0203PPR [Pages] => 3 [dc:format] => application/pdf [dc:creator] => QIJS [dc:description] => B0203PPR [xmp:createdate] => 2016-11-01T16:10:42 [xmp:creatortool] => IBM i program CBFEXPO/B0203P [xmp:modifydate] => 2019-01-14T10:59:15+01:00 [xmp:metadatadate] => 2019-01-14T10:59:15+01:00 [pdf:producer] => 2A55SAM Spool-a-Matic by Gumbo Software Inc V2R7M0 [xmpmm:documentid] => uuid:a93675b5-9dae-584d-baf5-106a071e6cfc [xmpmm:instanceid] => uuid:a8b43f68-a150-464b-94ce-f5079011be95 )

[22-Nov-2023 15:48:38 Europe/Paris] Number of pages : 3 [22-Nov-2023 15:48:38 Europe/Paris] $text: ŧ£™‰£@„…@¥–£™…@ƒ–”—£…@…• ÅäÙ M×™–£À‡À@—™@“@ǁ™•£‰…@„…¢@ÄÀ—Ë£¢K@¦¦¦K‡™•£‰…„…¢„…—–£¢K†™] @ÂÖäÙâÖÙÁÔÁ@ÅââÅÕãÉÅÓ@×Óäâ @@@@ñ ÔÙ@Öä@ÔÔÅ@ÅÔÔÁÕäÅÓ@ÂÙÖÃÈÖã ñô@ÙäÅ@âÅÇäÉÅÙ ×ÁÙÉâ ÷õððö@×ÁÙÉâ @ñaññaòðñö ôðöñøøðòöóðððôðò÷ôöõø ñõÅäÙ @ñañðaòðñö óñañðaòðñö@@@@ñKðððkðð@Ÿ @@ðkðððððð@l@@@ñ ÂÖäâÆÙ××ççç ÆÙ÷öôðöñøøðòöóðððôðò÷ôöõøñõ ÔÖäåÅÔÅÕãâ@ÅÕ@ÅäÙ âÖÓÄÅ@Áä@z óðaðùaòðñö @@@@@@@@@@@@ñKöòñkóö @óañðaòðñö Ù…“…¥À@„‰††À™À@Á™£…@ôù÷ù\\ö÷ôñ @óañðaòðñö@@@@@@@@@ñKõðøkôó @óañðaòðñö Ù…“…¥À@„‰††À™À@Á™£…@ôù÷ù\\ùñøù @óañðaòðñö@@@@@@@@@@@ô÷ökõø @óañðaòðñö Ù…“…¥À@„‰††À™À@Á™£…@ôù÷ù\\ùñøù @óañðaòðñö@@@@@@@@@@@@@ùkùù

Code

My code : // Parse PDF file and build necessary objects. $parser = new Parser(); $pdf = $parser->parseFile($pdf_file_name); $config = $parser->getConfig(); error_log(' $config: '.print_r( $config, true ) );

$metaData = $pdf->getDetails(); error_log(' $metaData: '.print_r( $metaData, true ) );

$pages = $pdf->getPages(); $nb_pages = sizeof( $pages ); error_log('Number of pages : '.$nb_pages );

$text = $pdf->getText(); error_log(' $text: '.print_r( $text, true ) );

eddih19 commented 1 year ago

I've got a similar problem as well with one of my files. Others seem to work fine. The problem file: pdf_example.pdf getText() resulting in: �0�H�U�N���1�U���� �� �2�P�V�F�K�U�L�M�Y�L�Q�J����

I've tried all types of settings and decoding but to no avail. I'm keeping an eye on this thread!

eddih19 commented 1 year ago

@LudovicMaillet just found these 2 similar issues, apparently it's related to the decoding of certain font types ( identity-H );

https://github.com/smalot/pdfparser/issues/534 https://github.com/smalot/pdfparser/issues/641 (the last few comments)

It doesn't seem that there is solution yet, hopefully someone corrects me on that!

-- Edit; it seems that your document does not contain the identity-H font type, so you might have another problem.

LudovicMaillet commented 1 year ago

Hello I tried with other PDF parser, like PDFCrowd, and its working very well with this documents. It concerns all Boursorama bank details since 2016 Regards Ludovic