Open LudovicMaillet opened 1 year ago
I've got a similar problem as well with one of my files. Others seem to work fine. The problem file: pdf_example.pdf getText() resulting in: �0�H�U�N���1�U���� �� �2�P�V�F�K�U�L�M�Y�L�Q�J����
I've tried all types of settings and decoding but to no avail. I'm keeping an eye on this thread!
@LudovicMaillet just found these 2 similar issues, apparently it's related to the decoding of certain font types ( identity-H ); (the last few comments)
It doesn't seem that there is solution yet, hopefully someone corrects me on that!
-- Edit; it seems that your document does not contain the identity-H font type, so you might have another problem.
Hello I tried with other PDF parser, like PDFCrowd, and its working very well with this documents. It concerns all Boursorama bank details since 2016 Regards Ludovic
Hello Very strange characters are returned by parsing bank detail PDF
PDF input
Expected output & actual output
[22-Nov-2023 15:48:38 Europe/Paris] $config: Smalot\PdfParser\Config Object ( [fontSpaceLimit:Smalot\PdfParser\Config:private] => -50 [horizontalOffset:Smalot\PdfParser\Config:private] =>
[pdfWhitespaces:Smalot\PdfParser\Config:private] => [22-Nov-2023 15:48:38 Europe/Paris] $metaData: Array ( [Author] => QIJS [CreationDate] => 2016-11-01T16:10:42+00:00 [Creator] => IBM i program CBFEXPO/B0203P [ModDate] => 2019-01-14T10:59:15+01:00 [Producer] => 2A55SAM Spool-a-Matic by Gumbo Software Inc V2R7M0 [Subject] => B0203PPR [Pages] => 3 [dc:format] => application/pdf [dc:creator] => QIJS [dc:description] => B0203PPR [xmp:createdate] => 2016-11-01T16:10:42 [xmp:creatortool] => IBM i program CBFEXPO/B0203P [xmp:modifydate] => 2019-01-14T10:59:15+01:00 [xmp:metadatadate] => 2019-01-14T10:59:15+01:00 [pdf:producer] => 2A55SAM Spool-a-Matic by Gumbo Software Inc V2R7M0 [xmpmm:documentid] => uuid:a93675b5-9dae-584d-baf5-106a071e6cfc [xmpmm:instanceid] => uuid:a8b43f68-a150-464b-94ce-f5079011be95 )
[22-Nov-2023 15:48:38 Europe/Paris] Number of pages : 3 [22-Nov-2023 15:48:38 Europe/Paris] $text: ŧ£™‰£@„…@¥–£™…@ƒ–”—£…@…• ÅäÙ M×™–£À‡À@—™@“@Ç™•£‰…@„…¢@ÄÀ—Ë£¢K@¦¦¦K‡™•£‰…„…¢„…—–£¢K†™] @ÂÖäÙâÖÙÁÔÁ@ÅââÅÕãÉÅÓ@×Óäâ @@@@ñ ÔÙ@Öä@ÔÔÅ@ÅÔÔÁÕäÅÓ@ÂÙÖÃÈÖã ñô@ÙäÅ@âÅÇäÉÅÙ ×ÁÙÉâ ÷õððö@×ÁÙÉâ @ñaññaòðñö ôðöñøøðòöóðððôðò÷ôöõø ñõÅäÙ @ñañðaòðñö óñañðaòðñö@@@@ñKðððkðð@Ÿ @@ðkðððððð@l@@@ñ ÂÖäâÆÙ××ççç ÆÙ÷öôðöñøøðòöóðððôðò÷ôöõøñõ ÔÖäåÅÔÅÕãâ@ÅÕ@ÅäÙ âÖÓÄÅ@Áä@z óðaðùaòðñö @@@@@@@@@@@@ñKöòñkóö @óañðaòðñö Ù…“…¥À@„‰††À™À@Ù£…@ôù÷ù\\ö÷ôñ @óañðaòðñö@@@@@@@@@ñKõðøkôó @óañðaòðñö Ù…“…¥À@„‰††À™À@Ù£…@ôù÷ù\\ùñøù @óañðaòðñö@@@@@@@@@@@ô÷ökõø @óañðaòðñö Ù…“…¥À@„‰††À™À@Ù£…@ôù÷ù\\ùñøù @óañðaòðñö@@@@@@@@@@@@@ùkùù
My code : // Parse PDF file and build necessary objects. $parser = new Parser(); $pdf = $parser->parseFile($pdf_file_name); $config = $parser->getConfig(); error_log(' $config: '.print_r( $config, true ) );
$metaData = $pdf->getDetails(); error_log(' $metaData: '.print_r( $metaData, true ) );
$pages = $pdf->getPages(); $nb_pages = sizeof( $pages ); error_log('Number of pages : '.$nb_pages );
$text = $pdf->getText(); error_log(' $text: '.print_r( $text, true ) );