Jruby nokogiri 1.9.0+, rehomed elements don't keep namespaces, plus regression from 1.8

sparklemotion / nokogiri

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.

https://nokogiri.org/

MIT License

6.15k stars 901 forks source link

Jruby nokogiri 1.9.0+, rehomed elements don't keep namespaces, plus regression from 1.8 #1875

Closed jrochkind closed 3 years ago

jrochkind commented 5 years ago

This is a followup to #1774. I'm sorry it took me a while to get back around to it.

While a fix to #1774 was provided in #1778, the fix only works for a default namespace, not one with a prefix. (It hadn't occured to me in my repro for #1774 that I should provide both, as they may be have different codepaths in java nokogiri I guess? )

What is worse, there is a regression in 1.9.0+, possibly introduced by #1778, but I haven't verified this.

I used to be able to work around #1774, by explicitly calling add_namespace to add the namespaces I wanted on the rehomed element.

However, my workaround that worked in JRuby 1.8.5 stopped working in JRuby 1.9.0+, so I'm kind of stymied and maybe even worse off now. :(

Reproductions are all in MRI 2.5 and JRuby 9.2.0.0, but I do not believe specific MRI/JRuby versions matter.

xml_str = <<~EOS
<?xml version="1.0" encoding="UTF-8"?>
<top xmlns="http://example.org/top" xmlns:a="http://example.org/a">
<record>
<a:something>a:something</a:something>
</record>
</top>
EOS

noko_doc = Nokogiri::XML.parse(xml_str)

record_element = noko_doc.xpath("//top:record", {top: "http://example.org/top"}).first

new_doc = Nokogiri::XML::Document.new
new_doc.root = record_element

puts new_doc.namespaces

# JRuby nokogiri 1.8.5:
#    {}

# JRuby nokogiri 1.9.0+: default ns is there, but the `a` namespace did not transfer
#    {"xmlns"=>"http://example.org/top"}

# MRI 2.5, both nokogiri versions
# (and I consider this the correct and most useful result):
#    {"xmlns"=>"http://example.org/top", "xmlns:a"=>"http://example.org/a", "xmlns:b"=>"http://example.org/b"}

# Workaround I used to use in Jruby nokogiri 1.8.0, which now fails in Jruby nokogiri 1.9.0
new_doc.root.add_namespace("a", "http://example.org/a")

puts "\nafter add_namespace:\n"
puts new_doc.namespaces.inspect
puts new_doc.to_xml

# JRuby nokogiri 1.8.5:
# the one we added was there:
# {"xmlns:a"=>"http://example.org/a"}
#
# And is included in the output:
# <?xml version="1.0"?>
# <record xmlns:a="http://example.org/a">
#   <a:something>a:something</a:something>
# </record>

# JRuby nokogiri 1.9.0+:
# The one we added isn't there at all, and is not included in the output either:
# {"xmlns"=>"http://example.org/top"}  # new namespace we added had no effect on root element
#
# And the "a" namespace is magically on the <a:something> tag, but we wanted to add it on
# the _root_
# <?xml version="1.0"?>
# <record xmlns="http://example.org/top">
#   <a:something xmlns:a="http://example.org/a">a:something</a:something>
# </record>
# 
# 
# MRI does not require the "add_namespace" workaround, and still outputs what
# I consider correct and desirable XML:
# <?xml version="1.0"?>
# <record xmlns="http://example.org/top" xmlns:a="http://example.org/a">
#    <a:something>a:something</a:something>
# </record>

jvshahid commented 5 years ago

This is a followup to #1774. I'm sorry it took me a while to get back around to it.

While a fix to #1774 was provided in #1778, the fix only works for a default namespace, not one with a prefix. (It hadn't occured to me in my repro for #1774 that I should provide both, as they may be have different codepaths in java nokogiri I guess? )

AFAIU the JRuby implementation is conforming to the standard. See my explanation inlined below.

What is worse, there is a regression in 1.9.0+, possibly introduced by #1778, but I haven't verified this.

I used to be able to work around #1774, by explicitly calling add_namespace to add the namespaces I wanted on the rehomed element.

I am not sure what is going on here. In particular, I don't remember what add_namespace does. I will have to look into this later.

Reproductions are all in MRI 2.5 and JRuby 9.2.0.0, but I do not believe specific MRI/JRuby versions matter.

xml_str = <<~EOS
<?xml version="1.0" encoding="UTF-8"?>
<top xmlns="http://example.org/top" xmlns:a="http://example.org/a">
<record>
<a:something>a:something</a:something>
</record>
</top>
EOS

noko_doc = Nokogiri::XML.parse(xml_str)

record_element = noko_doc.xpath("//top:record", {top: "http://example.org/top"}).first

new_doc = Nokogiri::XML::Document.new
new_doc.root = record_element

puts new_doc.namespaces

# JRuby nokogiri 1.8.5:
#    {}

# JRuby nokogiri 1.9.0+: default ns is there, but the `a` namespace did not transfer
#    {"xmlns"=>"http://example.org/top"}

This is expected. XML::Node::namespaces returns the namespaces that are in scope of the current node. As pointed out later, JRuby puts the namespace declaration on the something element. AFAIU both MRI and JRuby implementation are correct. I am also curious about what @flavorjones think regarding this difference.

# MRI 2.5, both nokogiri versions
# (and I consider this the correct and most useful result):
#    {"xmlns"=>"http://example.org/top", "xmlns:a"=>"http://example.org/a", "xmlns:b"=>"http://example.org/b"}

I don't understand where http://example.org/b is coming from. Is that a copy paste error ?

# And the "a" namespace is magically on the <a:something> tag, but we wanted to add it on
# the _root_
# <?xml version="1.0"?>
# <record xmlns="http://example.org/top">
#   <a:something xmlns:a="http://example.org/a">a:something</a:something>
# </record>

It isn't magical. The namespaces that are in scope of something are the default namespace and the xmlns:a namespace. For example puts new_doc.root.children[1].namespaces.inspect (children[0] is a text node) prints both namespaces:

{"xmlns:a"=>"http://example.org/a", "xmlns"=>"http://example.org/top"}

jrochkind commented 5 years ago

Thanks for response!

Are you suggesting that MRI and Jruby are both correct, even though they don't match? I don't understand how this can be, I thought MRI and JRuby nokogiri implementations could be expected to match. To me, the MRI implementation is the most useful, as well as seeming correct. The JRuby implementation has changed between nokogiri 1.8 and 1.9, and neither of them match MRI. (Are you suggesting that MRI ought to match... JRuby 1.9, even though it never has?)

What I need to do is remove a sub-tree (a node and it's children) from a document, and make it into it's own document -- with all namespaces semantically intact. In MRI nokogiri, I can do this. In Java nokogiri 1.8, I could do it via a workaround (I considered it a workaround to a bug). In Java nokogiri 1.9+, I can find no way to do this at all. (Short of serializing to a string and re-parsing). That doesn't seem right to me.

I don't understand where http://example.org/b is coming from. Is that a copy paste error ?

It was, copy-pasted from the wrong example, as I was messing with different examples to try and isolate the error. (Not sure if it's more or less confusing for me to go edit my original text at this point?)

jvshahid commented 5 years ago

Are you suggesting that MRI and Jruby are both correct, even though they don't match ?

Yes. They are equivalent, AFAIK.

I don't understand how this can be, I thought MRI and JRuby nokogiri implementations could be expected to match.

Not necessarily, specially if the output XML is equivalent, such as this case.

To me, the MRI implementation is the most useful, as well as seeming correct.

I don't think I understand how the MRI output is more useful. I think this is worth clarifying.

What I need to do is remove a sub-tree (a node and it's children) from a document, and make it into it's own document -- with all namespaces semantically intact.

Isn't that exactly what the current version of Nokogiri does ? JRuby is adding the namespace declaration to the node itself, instead of adding the namesapce declaration on the parent node. Both implementations are correct 1.

[...]

In Java nokogiri 1.9+, I can find no way to do this at all. (Short of serializing to a string and re-parsing). That doesn't seem right to me.

Can you provide an example code to demonstrate how the JRuby's output is problematic. Is the consumer of the output assuming that doc.namespaces will return all namespaces in the document ?

jrochkind commented 5 years ago

Basically I want to write code that uses namespaces that will work under MRI or JRuby, and write tests that will pass under either. I want equivalent behavior. Which I thought was the contract of nokogiri.

This includes add_namespace having the same effect in either, namespaces returning the same thing in either, and produced serialization from the same operations on the same input being equivalent in both.

But I see your point about the XML trees being semantically equivalent in both. (Note: semantically equivalent to a namespace-aware processor; probably not to a non-namespace-aware-processor that thinks "a:something" is just a tag name and "xmlns:a" just an attribute name.) I will spend some more with it and satisfy myself that this is true; it still wouldn't totally satisfy me, but I see your point.

It seems likely to me that there are cases where the JRuby noko 1.9 version would produce a serialization that is a lot more bytes than the MRI version. If there are a lot of a:something elements, which in the JRuby version each get an xmlns declaration serialized on that element.

Which suggests another point: Let's say I have an in-memory representation that matches:

<record>
    <a:something xmlns:a="http://example.org/a">a:something</a:something>
    <a:something xmlns:a="http://example.org/a">a:something</a:something>
</record>

If I want to "hoist" that xmlns:a= decleration to be on the record element instead, so that when it is serialized it's on the record element, and not present on the a:something elements -- should nokogiri give me API to do that?

I don't think nokogiri gives me API to do that. The nokogiri API doesn't let you alter namespace declarations much, if at all. At least in Jruby nokogiri, not totally sure about MRI nokogiri (if they differ, that goes back to the point about MRI/nokogiri divergence).

jrochkind commented 5 years ago

Thanks for your analysis @jvshahid !

I see how the documents are semantically identical. It could be (and was!) a lot worse.

I still think ideally:

Jruby behavior should match MRI. So test or other code written under MRI can pass in JRuby without changes, and JRuby nokogiri can be considered entirely compat with MRI.
Failing that (or for other reasons), there ought to be a way to use Nokogiri API on an in-memory representation to change what elements xlmns declarations appear on. I don't believe there is any such API (def on Jruby, I tried lots of things, but I believe in MRI too). xmlns namespace declerations (and on what element they will be serialized when writing out) are essentially immutable.

However, in the current implementation, I've changed my code to test for "namespace-aware semantic equivalence", where it was failing because serializations were not identical.

(In particular, if you take an existing document, and do node.to_xml, vs re-homing that node to new_document.root = node and doing a to_xml, you may see xmlns attributes on different nodes in those two different serializations, in Jruby).

Here is the pretty hacky method I wrote to test "namespace-aware semantic equivalence" of two different nokogiri elements, in case it's useful to anyone.

  def ns_semantic_equivalent_xml?(noko_a, noko_b)
    noko_a = noko_a.root if noko_a.kind_of?(Nokogiri::XML::Document)
    noko_b = noko_b.root if noko_b.kind_of?(Nokogiri::XML::Document)

    noko_a.name == noko_b.name &&
      noko_a.namespace&.prefix == noko_b.namespace&.prefix &&
      noko_a.namespace&.href   == noko_b.namespace&.href &&
      noko_a.attributes        == noko_b.attributes &&
      noko_a.children.length   == noko_b.children.length &&
      noko_a.children.each_with_index.all? do |a_child, index|
        ns_semantic_equivalent_xml?(a_child, noko_b.children[index])
      end
  end

flavorjones commented 3 years ago

Hi, sorry for not responding before now on this thread.

Jruby behavior should match MRI

Maybe in an ideal world; but the JRuby implementation is a completely separate implementation from the CRuby backend and the benefit in this case is low relative to the cost. The "correct" implementation could be argued either way, and I'm unaware of any XML spec that covers reparenting nodes into new documents.

I would welcome a pull request that cleaned this up, but otherwise I will not personally prioritize it. I'm sorry your experience isn't seamless across these implementations.

flavorjones commented 3 years ago

For posterity, the CRuby behavior is to create a doc that looks like:

<?xml version="1.0"?>
<record xmlns="http://example.org/top" xmlns:a="http://example.org/a">
<a:something>a:something</a:something>
</record>

and the JRuby behavior is:

<?xml version="1.0"?>
<record xmlns="http://example.org/top">
<a:something xmlns:a="http://example.org/a">a:something</a:something>
</record>

jrochkind commented 3 years ago

Thanks. If anyone can figure out any invocation to get the JRuby serialized doc to look like the CRuby one, like any methods I can call to make that happen, please do share! I would love to be able to create a doc by mutating existing documents, in Jruby, that would serialize the way the Cruby one does -- I actually used to have a workaround to do it prior to 1.9.0, but now can't figure out any way to do it at all, which is a problem. :(