Bad XML generated from valid input file

miki725 / xunitmerge

Utility for merging multiple XUnit xml reports into a single xml report.

Other

15 stars 13 forks source link

Bad XML generated from valid input file #4

Open santtu opened 7 years ago

santtu commented 7 years ago

With the input file

<?xml version="1.0" encoding="utf-8"?>
<testsuite errors="0" failures="0" name="" skips="0" tests="1" time="0">
  <testcase classname="test" file="file" line="15" name="test" time="0">
    <system-out>
        &lt;![CDATA[block data here]]&gt;
    </system-out>
  </testcase>
</testsuite>

and using that as the input only to generate a new file: xunitmerge in.xml out.xml will generate out.xml which is not valid XML:

<?xml version='1.0' encoding='utf-8'?>
<testsuite errors="0" failures="0" name="" skips="0" tests="1" time="0">
  <testcase classname="test" file="file" line="15" name="test" time="0">
    <system-out><![CDATA[
        <![CDATA[block data here]]>
    ]]></system-out></testcase>
</testsuite>

santtu commented 7 years ago

The underlying problem is that system-out can contain arbitrary values. In this case the it contained raw HTML, which in turn contained CDATA, and py.test had escaped the original HTML including turning <![CDATA into <![CDATA.

The code makes an implicit assumption that the data does not contain any CDATA blocks after it has been decoded from XML. So the original data did not contain CDATA, just "<" plus "CDATA" and so on, and the decoded does not contain CDATA block either (it is a string, not an XML element!).

However outputting the text value as-is assumes that it does not contain a textual CDATA representation, which in this case is incorrect.

santtu commented 7 years ago

Looking at the code ... what is the rationale for patch_etree_cname (https://github.com/miki725/xunitmerge/blob/master/xunitmerge/xmerge.py#L15)? I removed its use (from https://github.com/miki725/xunitmerge/blob/master/xunitmerge/xmerge.py#L129) and the generated XML is now valid.

I do not really understand why you would need to escape <system-out> and others in CDATA block. If the input is valid XML input (as it should be, when generated by nosetest or py.test) then outputting it as-is will keep it as valid XML. If the original text in these blocks is incorrectly formatted (e.g. it contains unescaped tags) then it is the generating program that is creating incorrect output -- but I don't see why xunitmerge should be fixing semantically incorrect XML input generated by other programs at all.

Is there some particular reason for patch_etree_cname that isn't evident for me here?

Spredzy commented 7 years ago

Any news on that one? We are facing the exact same issue with nested CDATA that is not valid XML[1]

[1] https://en.wikipedia.org/wiki/CDATA#Nesting