mozilla / pdf.js

PDF Reader in JavaScript
https://mozilla.github.io/pdf.js/
Apache License 2.0
48.11k stars 9.94k forks source link

Invalid metadata in test file "f1040.pdf" #1437

Closed saebekassebil closed 11 years ago

saebekassebil commented 12 years ago

While implementing the Metadata object and parsing Brendan found a document which didn't parse properly. The reason for this is that the metadata in the file (which is in XMP format which in turn is an XML format), is corrupted. The embedded XML document in the file is this: (Beautified)

<?xml version="1.0" encoding="UTF-8"?>
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2-jc015 52.362067, 2008 Oct 21 15:11:25-PDT (debug)">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description xmlns:xmp="http://ns.adobe.com/xap/1.0/" rdf:about="">
         <xmp:CreateDate>2008-11-18T08:45:06-05:00</xmp:CreateDate>
         <xmp:CreatorTool>Adobe LiveCycle Designer ES 8.2</xmp:CreatorTool>
         <xmp:ModifyDate>2011-11-04T16:02:51-04:00</xmp:ModifyDate>
         <xmp:MetadataDate>2011-11-04T16:02:51-04:00</xmp:MetadataDate>
      </rdf:Description>
      <rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="">
         <dc:format>application/pdf</dc:format>
         <dc:subject>
            <rdf:Bag>
               <rdf:li>Fillable</rdf:li>
            </rdf:Bag>
         </dc:subject>
         <dc:description>
            <rdf:Alt>
               <rdf:li xml:lang="x-default">U.S. Individual Income Tax Return</rdf:li>
            </rdf:Alt>
         </dc:description>
         <dc:creator>
            <rdf:Seq>
               <rdf:li>SE:W:CAR:MP</rdf:li>
            </rdf:Seq>
         </dc:creator>
         <dc:title>
            <rdf:Alt>
               <rdf:li xml:lang="x-default">2011 Form 1040</rdf:li>
            </rdf:Alt>
         </dc:title>
      </rdf:Description>
      <rdf:Description xmlns:pdf="http://ns.adobe.com/pdf/1.3/" rdf:about="">
         <pdf:Keywords>Fillable</pdf:Keywords>
         <pdf:Producer>Adobe LiveCycle Designer ES 8.2</pdf:Producer>
      </rdf:Description>
      <rdf:Description xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/" rdf:about="">
         <xmpMM:DocumentID>uuid:3ce15072-4e1e-4f71-b226-34eb4f217f5c</xmpMM:DocumentID>
         <xmpMM:InstanceID>uuid:ac88d95b-85b1-2a6e-5f15-404636056cd3</xmpMM:InstanceID>
      </rdf:Description>
      <rdf:Description xmlns:pdfx="http://ns.adobe.com/pdfx/1.3/" rdf:about="">
         <pdfx:Accessibility>structured; tagged</pdfx:Accessibility>
         <pdfx:Form⃇0020fields>fillable</pdfx:Form⃇0020fields>
      </rdf:Description>
      <rdf:Description xmlns:adhocwf="http://ns.adobe.com/AcrobatAdhocWorkflow/1.0/" rdf:about="">
         <adhocwf:state>1</adhocwf:state>
         <adhocwf:version>1.1</adhocwf:version>
      </rdf:Description>
   </rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>

The begin attribute of the <xpacket> element is invalid. It should either be empty or contain U+FEFF, the Unicode "zero width non-breaking space character". Furthermore there is an invalid tagName pdfx:Form⃇0020fields as the character ^ is invalid in tag names.

Chrome's DOMParser implementation parses this without error, even though the document is invalid, but Firefox and probably also other browsers fail to parse (as they should).

I'm aware that this maybe not be a pressing issue, but it's nonetheless actually an invalid document, as the embedded XML document is invalid.

brendandahl commented 12 years ago

Related PR #1413

gigaherz commented 12 years ago

The document appears to be encoded in utf-8 (without any BOM), but parsed as ascii. "" is what an UTF-8 encoded ZWNBS character looks like if you see it as ascii. If I reinterpret that XML as utf-8 in Notepad++, it passes the XML checker, but w3school's xml validator fails on the tagname.