sparklemotion / nokogiri

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.
https://nokogiri.org/
MIT License
6.15k stars 897 forks source link

JRuby nokogiri incorrectly include xml declaration for html transformation #1430

Open jvshahid opened 8 years ago

jvshahid commented 8 years ago

Using the following code:

input_xml = <<-EOS
<?xml version="1.0" encoding="utf-8"?>
<report>
  <title>My Report</title>
</report>
EOS

input_xsl = <<-EOS
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="/">
    <html>
      <head>
        <title><xsl:value-of select="report/title"/></title>
      </head>
      <body>
        <h1><xsl:value-of select="report/title"/></h1>
      </body>
    </html>
  </xsl:template>
</xsl:stylesheet>
EOS

require 'nokogiri'

xml = ::Nokogiri::XML(input_xml)
xsl = ::Nokogiri::XSLT(input_xsl)

puts xsl.apply_to(xml)

expected behavior:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>My Report</title>
</head>
<body><h1>My Report</h1></body>
</html>

actual behavior:

<?xml version="1.0" encoding="UTF-8"?><html><head><title>My Report</title></head><body><h1>My Report</h1></body></html>
cbasguti commented 1 year ago

Hey everyone! Just wanted to give you a heads up that I'm actively working on this issue and putting in my best effort to find a solution.

stevecheckoway commented 1 year ago

I'm not an expert on xslt, but shouldn't the method be set by an xsl:output element?

stevecheckoway commented 1 year ago

To follow up on this, I think our jruby XSLT processor is incorrectly determining the default method. Here's what the standard has to say:

The default for the method attribute is chosen as follows. If

  • the root node of the result tree has an element child,

  • the expanded-name of the first element child of the root node (i.e. the document element) of the result tree has local part html (in any combination of upper and lower case) and a null namespace URI, and

  • any text nodes preceding the first element child of the root node of the result tree contain only whitespace characters,

then the default output method is html; otherwise, the default output method is xml. The default output method should be used if there are no xsl:output elements or if none of the xsl:output elements specifies a value for the method attribute.

It's not clear to me why this is failing. The UNKNOWN method's description doesn't match what the standard says, but it seems like it should still be serializing this as html.

To get the output you want, I think you just need to use an xsl:output element. I've added a line to your example:

input_xml = <<-EOS
<?xml version="1.0" encoding="utf-8"?>
<report>
  <title>My Report</title>
</report>
EOS

input_xsl = <<-EOS
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="html" encoding="utf-8" />
  <xsl:template match="/">
    <html>
      <head>
        <title><xsl:value-of select="report/title"/></title>
      </head>
      <body>
        <h1><xsl:value-of select="report/title"/></h1>
      </body>
    </html>
  </xsl:template>
</xsl:stylesheet>
EOS

require 'nokogiri'

xml = ::Nokogiri::XML(input_xml)
xsl = ::Nokogiri::XSLT(input_xsl)

puts xsl.apply_to(xml)

The output I get is

<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>My Report</title>
</head>
<body>
<h1>My Report</h1>
</body>
</html>
root@e350b53df4bf:/usr/src/myapp# jruby --version
jruby 9.4.3.0 (3.1.4) 2023-06-07 3086960792 OpenJDK 64-Bit Server VM 25.372-b07 on 1.8.0_372-b07 +jit [aarch64-linux]
root@e350b53df4bf:/usr/src/myapp# nokogiri --version
/usr/local/bundle/gems/nokogiri-1.15.2-java/lib/nokogiri/xml/node.rb:1007: warning: method redefined; discarding old attr
# Nokogiri (1.15.2)
    ---
    warnings: []
    nokogiri:
      version: 1.15.2
    ruby:
      version: 3.1.4
      platform: java
      gem_platform: universal-java-1.8
      description: jruby 9.4.3.0 (3.1.4) 2023-06-07 3086960792 OpenJDK 64-Bit Server VM
        25.372-b07 on 1.8.0_372-b07 +jit [aarch64-linux]
      engine: jruby
      jruby: 9.4.3.0
    other_libraries:
      isorelax:isorelax: '20030108'
      net.sf.saxon:Saxon-HE: 9.6.0-4
      net.sourceforge.htmlunit:neko-htmlunit: 2.63.0
      nu.validator:jing: 20200702VNU
      org.nokogiri:nekodtd: 0.1.11.noko2
      xalan:serializer: 2.7.3
      xalan:xalan: 2.7.3
      xerces:xercesImpl: 2.12.2
      xml-apis:xml-apis: 1.4.01
flavorjones commented 1 year ago

I agree with @stevecheckoway's take that Xalan should be handling this case correctly, looking at the code for ToUknownStream.java.

If someone could throw this into a java debugger and tell us what's going on in Xalan that would be extremely helpful. I just spent an hour trying to get jdb to work on my system and couldn't figure it out.