smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.37k stars 537 forks source link

Weird UTF-8 Characters when parsing hex string (keywords) #683

Closed code-mage-com closed 6 months ago

code-mage-com commented 7 months ago

Description:

When parsing attached PDF file, the keywords include weird "japanese" characters such as "挀爀椀猀琀椀愀渀愀ⰰ 最攀猀豈ⰰ 甀漀ݠؐȁـڐȁذڐ۰ذذ۰ۀؐ݀ؐˀȁذ۰niglietti"

After debugging, I found out that the reason is in ElementHexa::decode() where the value parameter is passed a hex string that is split in several lines of 80 chars per line, like so:

feff007000610073007100750061002c0020007000720069006d00610076006500720061002c0020
0072006500730075007200720065007a0069006f006e0065002c0020006600650073007400610020
0063007200690073007400690061006e0061002c002000670065007300f9002c00200075006f0076
0061002000640069002000630069006f00630063006f006c006100740061002c00200063006f006e
00690067006c00690065007400740069002c002000700075006c00630069006e0069002c00200070
00610073007100750061006c0065002c002000630061006d00700061006e0065002c002000640069
006e006100200072006500620075006300630069002c00200075006f007600610020006400690020
007000610073007100750061002c0020

I managed to get it working correctly by adding the preg_replace before initial length calculation here (to remove carriage returns and newlines, if any, before parsing):

    public static function decode(string $value): string
    {
        $text = '';
        $value = preg_replace('#[\r\n]+#', '', $value);
        $length = \strlen($value);

        if ('00' === substr($value, 0, 2)) {
            for ($i = 0; $i < $length; $i += 4) {
                $hex = substr($value, $i, 4);
                $text .= '&#'.str_pad(hexdec($hex), 4, '0', \STR_PAD_LEFT).';';
            }
        } else {
            for ($i = 0; $i < $length; $i += 2) {
                $hex = substr($value, $i, 2);
                $text .= \chr(hexdec($hex));
            }
        }
        $text = html_entity_decode($text, \ENT_NOQUOTES, 'UTF-8');

        return $text;
    }

PDF input

meta1.pdf

Expected output & actual output

Expected output:

pasqua, primavera, resurrezione, festa cristiana, gesù, uova di cioccolata, coniglietti, pulcini, pasquale, campane, dina rebucci, uova di pasqua, 

Actual output (without the preg_replace line):

pasqua, primavera,  倇〇倇  倇ꀆ逆倂쀂怆倇〇䀆ဂ挀爀椀猀琀椀愀渀愀Ⰰ 最攀猀豈Ⰰ 甀漀瘀ؐȀـڐȀذڐ۰ذذ۰ۀؐ݀ؐˀȀذ۰؎iglietti, pulcini, pဇ〇ဇ倆ဆ쀆倂쀂〆ဆ퀇ဆ倂쀂䀆ऀ渀愀 爀攀戀甀挀挀椀Ⰰ 甀漀瘀愀 搀椀 ܀ؐܰܐݐؐˀȀ

Code

GreyWyvern commented 7 months ago

Nice catch! :D

I think it would be better to strictly whitelist only hexadecimal digits rather than just excluding newlines and carriage returns:

$value = preg_replace('/[^0-9a-f]/i', '', $value);

But both definitely fix the issue with this file,

This should be a simple change, and if you can add your PDF to the test folder, you have a great file for a unit test. Do you want to create a PR for this fix?

GreyWyvern commented 6 months ago

@code-mage-com, may we add your test document meta1.pdf to the PdfParser repo so I can create a test case?

code-mage-com commented 6 months ago

Hi Brian,

sorry, have had no time to look at the PR (don't think I'll be able to get time to make one)...

As to the PDF document, no problem adding that to the test case.

Best Regards, Dmitri

On Wed, 6 Mar 2024 at 15:28, Brian Huisman @.***> wrote:

@code-mage-com https://github.com/code-mage-com, may we add your test document meta1.pdf to the PdfParser repo so I can create a test case?

— Reply to this email directly, view it on GitHub https://github.com/smalot/pdfparser/issues/683#issuecomment-1980993843, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQEFRCHVVYJN3PJJNK2XS7TYW4RXZAVCNFSM6AAAAABD2SYTRWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSOBQHE4TGOBUGM . You are receiving this because you were mentioned.Message ID: @.***>