Open rdmpage opened 3 months ago
For some PDFs (e.g., attached) the metadata is garbled. This seems to be associated with PDF's that are encrypted, but I don't know enough about the PDF standard to know whether encryption also applies to metadata.
TZ_316_4_Gorochov.pdf
Output from mutool is what I expect, e.g. Title is SYSTEMATICS OF THE AMERICAN KATYDIDS \(ORTHOPTERA: TETTIGONIIDAE\). COMMUNICATION 2:
SYSTEMATICS OF THE AMERICAN KATYDIDS \(ORTHOPTERA: TETTIGONIIDAE\). COMMUNICATION 2
mutool info TZ_316_4_Gorochov.pdf TZ_316_4_Gorochov.pdf: PDF-1.6 Info object (68 0 R): <</CreationDate(D:20121225141316+04'00')/Author(A.V. Gorochov)/Creator(PScript5.dll Version 5.2.2)/Producer(Acrobat Distiller 9.5.2 \(Windows\))/ModDate(D:20121225161815+04'00')/Title(SYSTEMATICS OF THE AMERICAN KATYDIDS \(ORTHOPTERA: TETTIGONIIDAE\). COMMUNICATION 2)>> Encryption object (70 0 R): <</Length 128/Filter/Standard/O<0EBA1908E5CD53B188213637794EA65838027C93E38494B55544F4375B294C90>/P -1036/R 3/U<8049AC430DA9683FBBC0F5C6392E856600000000000000000000000000000000>/V 2>> Pages: 22 ...
What I get from PdfParser is the following:
*** Metadata *** Array ( [CreationDate] => CŠtW“Ò˙Mð,¯š Wgá3agí ÂQ©wèAuthor] => F…Iœ§E [Creator] => Wþ%Õ’^J…Vt¾øt?[ºzqbäÿ#i [Producer] => FÎÞ†^_é[²l÷Â}>ì:dyøí% a¤»fi²å [ModDate] => CŠtW“Ò˙Mð.¯Œ Tgá3agí ¨wèu³pô.@‘Ïˇ{@[òÜ¡ÐèU^éÛ3x=؈"¬OÔLŽOˆFêfl½‚,‹'f H‚6 [Pages] => 22 )
<?php // Example of PDF with bad characters require_once (dirname(__FILE__) . '/vendor/autoload.php'); $filename = 'TZ_316_4_Gorochov.pdf'; $parser_config = new \Smalot\PdfParser\Config(); $parser_config->setRetainImageContent(false); $parser_config->setIgnoreEncryption(true); $parser = new \Smalot\PdfParser\Parser([], $parser_config); // parse PDF $pdf = $parser->parseFile($filename); // Metadata if (method_exists($pdf, 'getDetails')) { $metadata = $pdf->getDetails(); echo "*** Metadata ***\n"; print_r($metadata); } ?>
Description:
For some PDFs (e.g., attached) the metadata is garbled. This seems to be associated with PDF's that are encrypted, but I don't know enough about the PDF standard to know whether encryption also applies to metadata.
PDF input
TZ_316_4_Gorochov.pdf
Expected output & actual output
Output from mutool is what I expect, e.g. Title is
SYSTEMATICS OF THE AMERICAN KATYDIDS \(ORTHOPTERA: TETTIGONIIDAE\). COMMUNICATION 2
:What I get from PdfParser is the following:
Code