Closed dbarron closed 4 months ago
It looks like the Title metadata for these "bad" files is located in the XMP data only. However, extractXMPMetadata()
expects there to be a dc:format
tag that tells the PdfParser that the metadata is referring to the PDF, and not an embedded file. In these PDF files there is no dc:format
tag.
The fix for this is to merge the found XMP metadata if a dc:format
tag doesn't exist, and if it does exist, only merge it if the MIME-type is application/pdf
.
Updated line 290 from Document.php:
if (!isset($metadata['dc:format']) || 'application/pdf' == $metadata['dc:format']) {
Description:
I recently used pdfparser to load 2500 documents in to Drupal. I pulled the properties with getDetails() to populate the Drupal fields.
About 50 documents did not get the right title and other properties, even though I can see them when I look at the files with a text editor and every other PDF tool I used saw them (eCopy, Chrome, Linux document viewer, Funnelback search engine).
I've attached a zip with some files that failed, and a few that succeeded, in case that helps track this down.
PDF input
Attached.
Expected output & actual output
Input is in attached file. Output expected to include all properties, but many are missing.
I have never seen an error of any kind.
Code
I have as similar function that uses parseContent and the problem is there, too. (I used parseConent to import existing files from an old system, and parseFile for new uploads.) Documents.zip