Open LudovicMaillet opened 1 year ago
I've got a similar problem as well with one of my files. Others seem to work fine. The problem file: pdf_example.pdf getText() resulting in: �0�H�U�N���1�U���� �� �2�P�V�F�K�U�L�M�Y�L�Q�J����
I've tried all types of settings and decoding but to no avail. I'm keeping an eye on this thread!
@LudovicMaillet just found these 2 similar issues, apparently it's related to the decoding of certain font types ( identity-H );
https://github.com/smalot/pdfparser/issues/534 https://github.com/smalot/pdfparser/issues/641 (the last few comments)
It doesn't seem that there is solution yet, hopefully someone corrects me on that!
-- Edit; it seems that your document does not contain the identity-H font type, so you might have another problem.
Hello I tried with other PDF parser, like PDFCrowd, and its working very well with this documents. It concerns all Boursorama bank details since 2016 Regards Ludovic
Description:
Hello Very strange characters are returned by parsing bank detail PDF
PDF input
Releve_compte_31_10_2016-1.pdf
Expected output & actual output
[22-Nov-2023 15:48:38 Europe/Paris] $config: Smalot\PdfParser\Config Object ( [fontSpaceLimit:Smalot\PdfParser\Config:private] => -50 [horizontalOffset:Smalot\PdfParser\Config:private] =>
[pdfWhitespaces:Smalot\PdfParser\Config:private] => [22-Nov-2023 15:48:38 Europe/Paris] $metaData: Array ( [Author] => QIJS [CreationDate] => 2016-11-01T16:10:42+00:00 [Creator] => IBM i program CBFEXPO/B0203P [ModDate] => 2019-01-14T10:59:15+01:00 [Producer] => 2A55SAM Spool-a-Matic by Gumbo Software Inc V2R7M0 [Subject] => B0203PPR [Pages] => 3 [dc:format] => application/pdf [dc:creator] => QIJS [dc:description] => B0203PPR [xmp:createdate] => 2016-11-01T16:10:42 [xmp:creatortool] => IBM i program CBFEXPO/B0203P [xmp:modifydate] => 2019-01-14T10:59:15+01:00 [xmp:metadatadate] => 2019-01-14T10:59:15+01:00 [pdf:producer] => 2A55SAM Spool-a-Matic by Gumbo Software Inc V2R7M0 [xmpmm:documentid] => uuid:a93675b5-9dae-584d-baf5-106a071e6cfc [xmpmm:instanceid] => uuid:a8b43f68-a150-464b-94ce-f5079011be95 )
[22-Nov-2023 15:48:38 Europe/Paris] Number of pages : 3 [22-Nov-2023 15:48:38 Europe/Paris] $text: ŧ£™‰£@„…@¥–£™…@ƒ–”—£…@…• ÅäÙ M×™–£À‡À@—™@“@Ç™•£‰…@„…¢@ÄÀ—Ë£¢K@¦¦¦K‡™•£‰…„…¢„…—–£¢K†™] @ÂÖäÙâÖÙÁÔÁ@ÅââÅÕãÉÅÓ@×Óäâ @@@@ñ ÔÙ@Öä@ÔÔÅ@ÅÔÔÁÕäÅÓ@ÂÙÖÃÈÖã ñô@ÙäÅ@âÅÇäÉÅÙ ×ÁÙÉâ ÷õððö@×ÁÙÉâ @ñaññaòðñö ôðöñøøðòöóðððôðò÷ôöõø ñõÅäÙ @ñañðaòðñö óñañðaòðñö@@@@ñKðððkðð@Ÿ @@ðkðððððð@l@@@ñ ÂÖäâÆÙ××ççç ÆÙ÷öôðöñøøðòöóðððôðò÷ôöõøñõ ÔÖäåÅÔÅÕãâ@ÅÕ@ÅäÙ âÖÓÄÅ@Áä@z óðaðùaòðñö @@@@@@@@@@@@ñKöòñkóö @óañðaòðñö Ù…“…¥À@„‰††À™À@Ù£…@ôù÷ù\\ö÷ôñ @óañðaòðñö@@@@@@@@@ñKõðøkôó @óañðaòðñö Ù…“…¥À@„‰††À™À@Ù£…@ôù÷ù\\ùñøù @óañðaòðñö@@@@@@@@@@@ô÷ökõø @óañðaòðñö Ù…“…¥À@„‰††À™À@Ù£…@ôù÷ù\\ùñøù @óañðaòðñö@@@@@@@@@@@@@ùkùù
Code
My code : // Parse PDF file and build necessary objects. $parser = new Parser(); $pdf = $parser->parseFile($pdf_file_name); $config = $parser->getConfig(); error_log(' $config: '.print_r( $config, true ) );
$metaData = $pdf->getDetails(); error_log(' $metaData: '.print_r( $metaData, true ) );
$pages = $pdf->getPages(); $nb_pages = sizeof( $pages ); error_log('Number of pages : '.$nb_pages );
$text = $pdf->getText(); error_log(' $text: '.print_r( $text, true ) );