Closed saebekassebil closed 11 years ago
Related PR #1413
The document appears to be encoded in utf-8 (without any BOM), but parsed as ascii. "" is what an UTF-8 encoded ZWNBS character looks like if you see it as ascii. If I reinterpret that XML as utf-8 in Notepad++, it passes the XML checker, but w3school's xml validator fails on the tagname.
While implementing the Metadata object and parsing Brendan found a document which didn't parse properly. The reason for this is that the metadata in the file (which is in XMP format which in turn is an XML format), is corrupted. The embedded XML document in the file is this: (Beautified)
The
begin
attribute of the<xpacket>
element is invalid. It should either be empty or containU+FEFF
, the Unicode "zero width non-breaking space character". Furthermore there is an invalid tagNamepdfx:Form⃇0020fields
as the character^
is invalid in tag names.Chrome's
DOMParser
implementation parses this without error, even though the document is invalid, but Firefox and probably also other browsers fail to parse (as they should).I'm aware that this maybe not be a pressing issue, but it's nonetheless actually an invalid document, as the embedded XML document is invalid.