smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.41k stars 537 forks source link

Title and other properties not read with getDetails for some files #721

Closed dbarron closed 4 months ago

dbarron commented 5 months ago

Description:

I recently used pdfparser to load 2500 documents in to Drupal. I pulled the properties with getDetails() to populate the Drupal fields.

About 50 documents did not get the right title and other properties, even though I can see them when I look at the files with a text editor and every other PDF tool I used saw them (eCopy, Chrome, Linux document viewer, Funnelback search engine).

I've attached a zip with some files that failed, and a few that succeeded, in case that helps track this down.

PDF input

Attached.

Expected output & actual output

Input is in attached file. Output expected to include all properties, but many are missing.

I have never seen an error of any kind.

Code

public function getMetadataFromFile($file) {
  $metadata = "";
  $parser = new Parser();

  $pdf = $parser->parseFile($file);
  $metadata = $pdf->getDetails();

  return $metadata;
}

I have as similar function that uses parseContent and the problem is there, too. (I used parseConent to import existing files from an old system, and parseFile for new uploads.) Documents.zip

GreyWyvern commented 5 months ago

It looks like the Title metadata for these "bad" files is located in the XMP data only. However, extractXMPMetadata() expects there to be a dc:format tag that tells the PdfParser that the metadata is referring to the PDF, and not an embedded file. In these PDF files there is no dc:format tag.

The fix for this is to merge the found XMP metadata if a dc:format tag doesn't exist, and if it does exist, only merge it if the MIME-type is application/pdf.

Updated line 290 from Document.php:

            if (!isset($metadata['dc:format']) || 'application/pdf' == $metadata['dc:format']) {