omegahat / XML

The XML package for R
Other
20 stars 11 forks source link

Add XSLT 1.0 transformation method #14

Closed ParfaitG closed 4 years ago

ParfaitG commented 4 years ago

As a feature request, consider adding XSLT 1.0 support into the omegahat/XML project. To be a fully robust DOM library with already XPath 1.0 support, omegahat/XML would benefit with XSLT 1.0 support, allowing users to transform XML files with external XSL stylesheets to generate new XML, HTML, even text output.

Currently, in R one must use the xslt library, a sister extension of xml2 which is a different XML package adopted into the tidyverse ecosystem. However, an integrated XSLT processor capability in XML can keep to the base R and S flavor and allow users to leverage existing code and libraries to integrate complex, nested, and dense XML files into R environment.

One constant need often asked in R on StackOverflow, is how to import a nested XML into a data frame. For flat (few nests), element-centric (no attributes), xmlToDataFrame is a very convenient function. But for more complex XML files with attribute values, various looped calls to lapply and xpathSApply is required to bind returned vectors into data frames.

One solution I have advocated on many SO solutions, is to use XSLT: a W3C standards-compliant, well-known, special-purpose, declarative language used regularly in the industry. With its functional, recursive nature, it can transform any original XML to any XML, HTML, or text output for end use needs such as flattening to 2D row-by-column structure for R data frames. However, to run XSLT in R requires a mix of tools as illustrated below.

library(XML)
library(xslt)

# LOAD XML AND XSL
input <- read_xml("/path/to/input.xml", package = "xslt")
style <- read_xml("/path/to/xslt_script.xsl", package = "xslt")

# TRANSFORM INPUT INTO OUTPUT
new_xml <- xml_xslt(input, style)
output <- as.character(new_xml)

# PARSE OUTPUT FROM STRING
doc <- xmlParse(output, asText=TRUE)

# BUILD DATAFRAME
df <- xmlToDataFrame(doc, nodes=getNodeSet(doc))

Alternatively, if using command line tools such as xsltproc on *Unix machines:

library(XML)

# TRANSFORM INPUT INTO OUTPUT
new_xml <- xml_xslt(input, style)
output <- as.character(new_xml)

# PARSE OUTPUT FROM STRING
doc <- xmlParse(output, asText=TRUE)

# COMMAND LINE CALL TO UNIX'S XSLTPROC (ALTERNATIVE TO xslt PACKAGE)
system("xsltproc -o /path/to/input.xml /path/to/xslt_script.xsl /path/to/output.xml")
doc <- xmlParse("/path/to/output.xml")

# BUILD DATAFRAME
df <- xmlToDataFrame(doc, nodes=getNodeSet(doc, '//race'))

And with Windows via a PowerShell script to interface to the built-in .NET System.Xml.Xsl Class

PowerShell

param ($xml, $xsl, $output)

if (-not $xml -or -not $xsl -or -not $output) {
    Write-Host "& .\xslt.ps1 [-xml] xml-input [-xsl] xsl-input [-output] transform-output"
    exit;
}

trap [Exception]{
    Write-Host $_.Exception;
}

$xslt = New-Object System.Xml.Xsl.XslCompiledTransform;
$xslt.Load($xsl);
$xslt.Transform($xml, $output);

Write-Host "generated" $output;

R

library(XML)

system(paste0('Powershell.exe -File',
              ' "C:\\Path\\To\\PowerShell\\Script.ps1"',
              ' "C:\\Path\\To\\Input.xml"',
              ' "C:\\Path\\To\\XSLT\\Script.xsl"', 
              ' "C:\\Path\\To\\Output.xml"'))

df <- xmlToDataFrame("C:\\Path\\To\\Output.xml")

Many open-source DOM libraries including Python's lxml, PHP's xsl class, Perl's XML::LibXSLT class, and even R's xslt package use the libxslt C library which maintains supported methods:

doc = xmlParseFile(...);
style = xsltParseStylesheetFile(...);
res = xsltApplyStylesheet(style, doc, params);

As a good project to galvanize activity on this awesome, omegahat/XML package, please consider XSLT 1.0 support in near future.

duncantl commented 4 years ago

The Sxslt package has been around for about 18 years.

ParfaitG commented 4 years ago

Ahhh, yes! I remember the Sxslt package but I recall having trouble with running it on Windows and at the time, it appeared to only be a Linux supported package. It was such a challenge to run, I stopped attempting to use it and apparently even forgotten all about it. I see also it is not available on CRAN unlike XML. Will re-attempt using devtools::install_github("omegahat/Sxslt") and report any issues on that Git page.