Parsing <wp:postmeta> using an attribute-based/less-verbose XML structure

drzraf commented 7 years ago

<wp:postmeta> is usually exported this way:

<wp:postmeta>
   <wp:meta_key>_wp_page_template</wp:meta_key>
   <wp:meta_value><![CDATA[some-string.tpl.php]]></wp:meta_value>
</wp:postmeta>

(156 bytes)

CDATA are used for the meta_value, so that HTML entities would not be interpreted by the XML parser.

This commit makes the parser aware of a second short format where post's meta are expressed as attributes: <wp:postmeta meta_key="_wp_page_template" meta_value="some-string.tpl.php" /> (85 bytes)

In most case, none of meta_key and meta_value contain HTML entities.
An exporter can perfectly define whether short form is ok for a given meta (entity presence, longtext meta_value length > X [XML does not set any limit on this btw])

ToDo: explore possible future use of XMLReader::SUBST_ENTITIES to completely avoid cluttering XML with CDATA?

drzraf commented 6 years ago

@pbiron , for 1.3? (as far as I understand, this repository seems abandonned by HumanMade)

rmccue commented 6 years ago

@drzraf I wouldn't call it abandoned, but it's not a focus of ours right now. @swissspidy and @hyperpress also have push permissions though. :)

drzraf commented 6 years ago

There are already dozens of PR pending for almost a year and discussion about them didn't even started. That's the guarantee for a future full of (avoidable) merge-conflicts that no one will want to resolve. I'm trying my best to create feature-branch all based on master for my own PR but as pending branches start conflicting themselves without any hint about mergeability future I'd not keep up.

swissspidy commented 6 years ago

I don't have a strong opinion on this, but it sounds like an extension of the WXR format to me (which doesn't really have a spec, but still) that isn't really needed.

What's the real benefit of this? Just the reduced size?

Are there any exporters that support this format already? If I'd write a custom exporter I would probably not implement this format because it would be too much of a hassle to determine whether I can use the short format or not.

pbiron commented 6 years ago

Not only is this "not needed" (as @swissspidy said), but could actually be bad if those writing custom exporters are not aware of some of the finer points of XML processing.

XML attribute values are subject to Attribute Value Normalization (XML Spec, Section 3.3.3). Among other things, the normalization algorithm specifies that all literal white space characters in attribute values be replaced by space characters. To quote the XML Spec:

Note that if the unnormalized attribute value contains a character reference to a white space character other than space (#x20), the normalized value contains the referenced character itself (#xD, #xA or #x9). This contrasts with the case where the unnormalized value contains a white space character (not a reference), which is replaced with a space character (#x20) in the normalized value and also contrasts with the case where the unnormalized value contains an entity reference whose replacement text contains a white space character; being recursively processed, the white space character is replaced with a space character (#x20) in the normalized value.

For example, to a conforming XML processor <element attr='#x9 #x20 this is a test' /> (where #x9 represents a literal tab character, and #x20 represents a literal space character )

is equivalent to <element attr='#x20 #x20 this is a test' />

I'd say it's a pretty good bet that many (most?) of those writing custom exporters are unaware of this and that if told they could export meta values as attributes they would do so in cases that would result in data loss/alteration on import.

Element values are not subject to this normalization. Hence, if the data may contain non-space white space characters it is always safer to represent it as element content and not in an attribute value.

And while PHP's XMLReader is not 100% conforming, it is with regard to Attribute Value Normalization.

So, I do not think this should be merged!

pbiron commented 6 years ago

p.s. sorry for not responding to this earlier! The ping back in April escaped me.

swissspidy commented 6 years ago

Sounds like a wontfix then :-)

I'm closing this for now.

humanmade / WordPress-Importer

Parsing <wp:postmeta> using an attribute-based/less-verbose XML structure #148