smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.41k stars 537 forks source link

Metadata content garbled for some PDFs #730

Open rdmpage opened 3 months ago

rdmpage commented 3 months ago

Description:

For some PDFs (e.g., attached) the metadata is garbled. This seems to be associated with PDF's that are encrypted, but I don't know enough about the PDF standard to know whether encryption also applies to metadata.

PDF input

TZ_316_4_Gorochov.pdf

Expected output & actual output

Output from mutool is what I expect, e.g. Title is SYSTEMATICS OF THE AMERICAN KATYDIDS \(ORTHOPTERA: TETTIGONIIDAE\). COMMUNICATION 2:

mutool info TZ_316_4_Gorochov.pdf
TZ_316_4_Gorochov.pdf:

PDF-1.6
Info object (68 0 R):
<</CreationDate(D:20121225141316+04'00')/Author(A.V. Gorochov)/Creator(PScript5.dll Version 5.2.2)/Producer(Acrobat Distiller 9.5.2 \(Windows\))/ModDate(D:20121225161815+04'00')/Title(SYSTEMATICS OF THE AMERICAN KATYDIDS \(ORTHOPTERA: TETTIGONIIDAE\). COMMUNICATION 2)>>
Encryption object (70 0 R):
<</Length 128/Filter/Standard/O<0EBA1908E5CD53B188213637794EA65838027C93E38494B55544F4375B294C90>/P -1036/R 3/U<8049AC430DA9683FBBC0F5C6392E856600000000000000000000000000000000>/V 2>>
Pages: 22
...

What I get from PdfParser is the following:

*** Metadata ***
Array
(
    [CreationDate] => CŠtW“Ò˙Mð,¯š Wgá3agí
ÂQ©wèAuthor] => F…Iœ§E
    [Creator] => Wþ%Õ’^J…Vt¾øt?[ºzqbäÿ#i
    [Producer] => FÎÞ†^_é[²l÷Â}>ì:dyøí%
                                       a¤»fi²å
    [ModDate] => CŠtW“Ò˙Mð.¯Œ Tgá3agí
¨wèu³pô.@‘Ïˇ{@[òÜ¡ÐèU^éÛ3x=؈"¬OÔLŽOˆFêfl½‚,‹'f  H‚6
    [Pages] => 22
)

Code

<?php

// Example of PDF with bad characters

require_once (dirname(__FILE__) . '/vendor/autoload.php');

$filename = 'TZ_316_4_Gorochov.pdf';

$parser_config = new \Smalot\PdfParser\Config();
$parser_config->setRetainImageContent(false);
$parser_config->setIgnoreEncryption(true);

$parser = new \Smalot\PdfParser\Parser([], $parser_config);

// parse PDF
$pdf = $parser->parseFile($filename);

// Metadata
if (method_exists($pdf, 'getDetails'))
{
    $metadata = $pdf->getDetails();

    echo "*** Metadata ***\n";
    print_r($metadata); 

}

?>