Closed ParfaitG closed 9 months ago
I am not sure what is going on here, but your input document seems to have large embedded html <![CDATA[
blobs inside the xml, and then your first xsl is using disable-output-escaping
.
As a result your new_doc
object contains large blobs of unparsed html text. Therefore xml2 can't apply the transformation, it is just text.
xml2::xml_child(new_doc)
xml2::xml_text( xml2::xml_child(new_doc))
Once you write the html text to disk while disabling escaping, and then read it again, the html actually gets parsed into an xml tree. But I think what you want to do is parse the individual html blobs?
Good point! I thought the CData
parsing would be the issue. Since my use case is more complex, my solution requires various conversions. Hence, I can avoid writing to disk by calling read_xml
on character conversion of the XSLT result.
# READ XML AND XSLT
doc <- read_xml("doc.kml", package = "xslt")
style1 <- read_xml("style1.xsl", package = "xslt")
style2 <- read_xml("style2.xsl", package = "xslt")
# RUN FIRST TRANSFORMATION
new_doc <- xml_xslt(doc, style1) |> as.character() |> read_xml()
final_doc <- xml_xslt(new_doc, style2)
final_doc
# {xml_document}
# <kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
# [1] <DATA>\n <ROUTE>1</ROUTE>\n <ROUTE0>001</ROUTE0>\n <NAME>BRONZEVILLE/UNION STATION</NAME>\n <WKDAY>1</WKDAY>\n <SAT>0</SAT>\n <SUN>0</SUN>\n <SHAPE.LEN>34690.953676</ ...
# [2] <DATA>\n <ROUTE>2</ROUTE>\n <ROUTE0>002</ROUTE0>\n <NAME>HYDE PARK EXPRESS</NAME>\n <WKDAY>1</WKDAY>\n <SAT>0</SAT>\n <SUN>0</SUN>\n <SHAPE.LEN>110607.498776</SHAPE.L ...
# [3] <DATA>\n <ROUTE>3</ROUTE>\n <ROUTE0>003</ROUTE0>\n <NAME>KING DRIVE</NAME>\n <WKDAY>1</WKDAY>\n <SAT>1</SAT>\n <SUN>1</SUN>\n <SHAPE.LEN>88297.447622</SHAPE.LEN>\n</D ...
# [4] <DATA>\n <ROUTE>4</ROUTE>\n <ROUTE0>004</ROUTE0>\n <NAME>COTTAGE GROVE</NAME>\n <WKDAY>1</WKDAY>\n <SAT>1</SAT>\n <SUN>1</SUN>\n <SHAPE.LEN>106219.449701</SHAPE.LEN>\ ...
# [5] <DATA>\n <ROUTE>5</ROUTE>\n <ROUTE0>005</ROUTE0>\n <NAME>SOUTH SHORE NIGHT BUS</NAME>\n <WKDAY>0</WKDAY>\n <SAT>0</SAT>\n <SUN>0</SUN>\n <SHAPE.LEN>67048.136707</SHAP
...
With a different, simpler example without CData
parsing, back to back XSLT transformations work as expected without any conversions:
# READ XML AND XSLT
doc <- read_xml("Input.xml", package = "xslt")
style1 <- read_xml("style1.xsl", package = "xslt")
style2 <- read_xml("style2.xsl", package = "xslt")
# RUN TRANSFORMATIONS
new_doc <- xml_xslt(doc, style1)
final_doc <- xml_xslt(new_doc, style2)
final_doc
# {xml_document}
# <data>
# [1] <aggdata>\n <industry>Media</industry>\n <SumOfRevenue>1.90416e+11</SumOfRevenue>\n <AvgOfAssets>7.84346e+10</AvgOfAssets>\n <AvgOfEquity>3.06608e+10</AvgOfEquity>\n <Ma ...
# [2] <aggdata>\n <industry>Oil & Gas</industry>\n <SumOfRevenue>7.6821e+11</SumOfRevenue>\n <AvgOfAssets>1.535778e+11</AvgOfAssets>\n <AvgOfEquity>8.12524e+10</AvgOfEquity ...
# [3] <aggdata>\n <industry>Pharmaceuticals</industry>\n <SumOfRevenue>2.10975e+11</SumOfRevenue>\n <AvgOfAssets>9.49038e+10</AvgOfAssets>\n <AvgOfEquity>4.6162e+10</AvgOfEquit ...
XML
<?xml version="1.0" encoding="UTF-8"?>
<data>
<bigcompany>
<company>Company OA</company>
<industry>Oil & Gas</industry>
<revenue>394105000000</revenue>
<assets>349493000000</assets>
<equity>174399000000</equity>
<netincome>32520000000</netincome>
<stockprice>89.38</stockprice>
<employees>75300</employees>
</bigcompany>
<bigcompany>
<company>Company OB</company>
<industry>Oil & Gas</industry>
<revenue>200494000000</revenue>
<assets>266026000000</assets>
<equity>156191000000</equity>
<netincome>19241000000</netincome>
<stockprice>108.62</stockprice>
<employees>64700</employees>
</bigcompany>
<bigcompany>
<company>Company OC</company>
<industry>Oil & Gas</industry>
<revenue>13807000000</revenue>
<assets>4726000000</assets>
<equity>16445000000</equity>
<netincome>2720000000</netincome>
<stockprice>48.5</stockprice>
<employees>22000</employees>
</bigcompany>
<bigcompany>
<company>Company OD</company>
<industry>Oil & Gas</industry>
<revenue>97800000000</revenue>
<assets>30500000000</assets>
<equity>10800000000</equity>
<netincome>2700000000</netincome>
<stockprice>27.53</stockprice>
<employees>45340</employees>
</bigcompany>
<bigcompany>
<company>Company OE</company>
<industry>Oil & Gas</industry>
<revenue>62004000000</revenue>
<assets>117144000000</assets>
<equity>48427000000</equity>
<netincome>8428000000</netincome>
<stockprice>66.66</stockprice>
<employees>16900</employees>
</bigcompany>
<bigcompany>
<company>Company PA</company>
<industry>Pharmaceuticals</industry>
<revenue>49605000000</revenue>
<assets>169274000000</assets>
<equity>71622000000</equity>
<netincome>9135000000</netincome>
<stockprice>30.14</stockprice>
<employees>78000</employees>
</bigcompany>
<bigcompany>
<company>Company PB</company>
<industry>Pharmaceuticals</industry>
<revenue>48047000000</revenue>
<assets>105128000000</assets>
<equity>56943000000</equity>
<netincome>6272000000</netincome>
<stockprice>55.43</stockprice>
<employees>76000</employees>
</bigcompany>
<bigcompany>
<company>Company PC</company>
<industry>Pharmaceuticals</industry>
<revenue>74331000000</revenue>
<assets>131119000000</assets>
<equity>69752000000</equity>
<netincome>16323000000</netincome>
<stockprice>102.31</stockprice>
<employees>126500</employees>
</bigcompany>
<bigcompany>
<company>Company PD</company>
<industry>Pharmaceuticals</industry>
<revenue>23113000000</revenue>
<assets>35249000000</assets>
<equity>17641000000</equity>
<netincome>4685000000</netincome>
<stockprice>67.2</stockprice>
<employees>37925</employees>
</bigcompany>
<bigcompany>
<company>Company PE</company>
<industry>Pharmaceuticals</industry>
<revenue>15879000000</revenue>
<assets>33749000000</assets>
<equity>14852000000</equity>
<netincome>2004000000</netincome>
<stockprice>58</stockprice>
<employees>28000</employees>
</bigcompany>
<bigcompany>
<company>Company MA</company>
<industry>Media</industry>
<revenue>48813000000</revenue>
<assets>84186000000</assets>
<equity>44958000000</equity>
<netincome>8004000000</netincome>
<stockprice>93.65</stockprice>
<employees>180000</employees>
</bigcompany>
<bigcompany>
<company>Company MB</company>
<industry>Media</industry>
<revenue>64657000000</revenue>
<assets>158813000000</assets>
<equity>51058000000</equity>
<netincome>7135000000</netincome>
<stockprice>57.05</stockprice>
<employees>139000</employees>
</bigcompany>
<bigcompany>
<company>Company MC</company>
<industry>Media</industry>
<revenue>31867000000</revenue>
<assets>54793000000</assets>
<equity>17418000000</equity>
<netincome>4514000000</netincome>
<stockprice>36.52</stockprice>
<employees>27000</employees>
</bigcompany>
<bigcompany>
<company>TCompany MD</company>
<industry>Media</industry>
<revenue>29795000000</revenue>
<assets>67994000000</assets>
<equity>29904000000</equity>
<netincome>3691000000</netincome>
<stockprice>84.3</stockprice>
<employees>26000</employees>
</bigcompany>
<bigcompany>
<company>Company ME</company>
<industry>Media</industry>
<revenue>15284000000</revenue>
<assets>26387000000</assets>
<equity>9966000000</equity>
<netincome>1879000000</netincome>
<stockprice>54.88</stockprice>
<employees>20915</employees>
</bigcompany>
</data>
XSLT 1
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="data">
<xsl:copy>
<xsl:apply-templates>
<xsl:sort select="industry" order="ascending"/>
<xsl:sort select="netincome" data-type="number" order="descending"/>
</xsl:apply-templates>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
XSLT 2
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:key name="indkey" match="bigcompany/industry" use="."/>
<xsl:template match="data">
<data>
<xsl:for-each select="bigcompany/industry[generate-id() = generate-id(key('indkey', .)[1])]">
<xsl:sort select="." order="ascending"/>
<aggdata>
<xsl:copy-of select="."/>
<SumOfRevenue><xsl:copy-of select="sum(key('indkey', .)/../revenue)"/></SumOfRevenue>
<AvgOfAssets><xsl:copy-of select="sum(key('indkey', .)/../assets) div count(key('indkey', .)/../assets)"/></AvgOfAssets>
<AvgOfEquity><xsl:copy-of select="sum(key('indkey', .)/../equity) div count(key('indkey', .)/../equity)"/></AvgOfEquity>
<MaxOfIncome><xsl:value-of select="key('indkey', .)[1]/../netincome"/></MaxOfIncome>
<MinOfIncome><xsl:value-of select="key('indkey', .)[5]/../netincome"/></MinOfIncome>
<AvgOfStockPrice><xsl:copy-of select="sum(key('indkey', .)/../stockprice) div count(key('indkey', .)/../stockprice)"/></AvgOfStockPrice>
<SumOfEmployees><xsl:copy-of select="sum(key('indkey', .)/../employees)"/></SumOfEmployees>
</aggdata>
</xsl:for-each>
</data>
</xsl:template>
</xsl:stylesheet>
Though, I do wonder if there is a non-API breaking way to implicitly attempt this XML tree conversion if XSLT targets method as xml
and not text
and result is a well-formed XML as my first transformation renders? Otherwise, fall back to character or text type? But embedded cdata XML and/or HTML may be edge cases.
And this may be beyond package levels as Python's lxml
behaves very similarly to R's xslt
, requiring same conversion of string to XML tree. I believe both use similar underlying XSLT engines.
import lxml.etree as lx
# READ XML AND XSLT
doc = lx.parse("doc.kml")
style1 = lx.parse("style1.xsl")
style2 = lx.parse("style2.xsl")
# RUN TRANSFORMATIONS
transformer1 = lx.XSLT(style1)
new_doc = lx.fromstring(str(transformer1(doc)))
transformer2 = lx.XSLT(style2)
final_doc = transformer2(new_doc)
print(final_doc)
# <?xml version="1.0"?>
# <kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
# <DATA>
# <ROUTE>1</ROUTE>
# <ROUTE0>001</ROUTE0>
# <NAME>BRONZEVILLE/UNION STATION</NAME>
# <WKDAY>1</WKDAY>
# <SAT>0</SAT>
# <SUN>0</SUN>
# <SHAPE.LEN>34690.953676</SHAPE.LEN>
# </DATA>
# <DATA>
# <ROUTE>2</ROUTE>
# <ROUTE0>002</ROUTE0>
# <NAME>HYDE PARK EXPRESS</NAME>
# <WKDAY>1</WKDAY>
# <SAT>0</SAT>
# <SUN>0</SUN>
# <SHAPE.LEN>110607.498776</SHAPE.LEN>
# </DATA>
# <DATA>
# <ROUTE>3</ROUTE>
# <ROUTE0>003</ROUTE0>
# <NAME>KING DRIVE</NAME>
# <WKDAY>1</WKDAY>
# <SAT>1</SAT>
# <SUN>1</SUN>
# <SHAPE.LEN>88297.447622</SHAPE.LEN>
# </DATA>
# ...
To transform a KML file for data frame build, I attempted back-to-back calls of
xml_xslt()
which does not yield correct result and does not raise any error. Oddly, only the root node outputs.However, saving first transformation to disk with
xml2::write_xml
followed byxml2::read_xml
and then run a secondxml_xslt
does output the correct, desired result. See below reproducible example with source files.Can issue involve the default KML namespace handling? Outputs of
read_xml
andxml_xslt
both returnxml_document
types. My XSLT 1.0 scripts are fully compliant, validated with Linux'sxsltproc
and with online fiddle.R (see differences in
final_doc
)Sources
KML
Chicago Transit Authority: CTA - Bus Routes KML
XSLT 1
XSLT 2
Session