Closed jrochkind closed 3 years ago
This is a followup to #1774. I'm sorry it took me a while to get back around to it.
While a fix to #1774 was provided in #1778, the fix only works for a default namespace, not one with a prefix. (It hadn't occured to me in my repro for #1774 that I should provide both, as they may be have different codepaths in java nokogiri I guess? )
AFAIU the JRuby implementation is conforming to the standard. See my explanation inlined below.
What is worse, there is a regression in 1.9.0+, possibly introduced by #1778, but I haven't verified this.
I used to be able to work around #1774, by explicitly calling
add_namespace
to add the namespaces I wanted on the rehomed element.
I am not sure what is going on here. In particular, I don't remember what add_namespace
does. I will have to look into this later.
Reproductions are all in MRI 2.5 and JRuby 9.2.0.0, but I do not believe specific MRI/JRuby versions matter.
xml_str = <<~EOS <?xml version="1.0" encoding="UTF-8"?> <top xmlns="http://example.org/top" xmlns:a="http://example.org/a"> <record> <a:something>a:something</a:something> </record> </top> EOS noko_doc = Nokogiri::XML.parse(xml_str) record_element = noko_doc.xpath("//top:record", {top: "http://example.org/top"}).first new_doc = Nokogiri::XML::Document.new new_doc.root = record_element puts new_doc.namespaces # JRuby nokogiri 1.8.5: # {} # JRuby nokogiri 1.9.0+: default ns is there, but the `a` namespace did not transfer # {"xmlns"=>"http://example.org/top"}
This is expected. XML::Node::namespaces
returns the namespaces that are in scope of the current node. As pointed out later, JRuby puts the namespace declaration on the something
element. AFAIU both MRI and JRuby implementation are correct. I am also curious about what @flavorjones think regarding this difference.
# MRI 2.5, both nokogiri versions # (and I consider this the correct and most useful result): # {"xmlns"=>"http://example.org/top", "xmlns:a"=>"http://example.org/a", "xmlns:b"=>"http://example.org/b"}
I don't understand where http://example.org/b
is coming from. Is that a copy paste error ?
# And the "a" namespace is magically on the <a:something> tag, but we wanted to add it on # the _root_ # <?xml version="1.0"?> # <record xmlns="http://example.org/top"> # <a:something xmlns:a="http://example.org/a">a:something</a:something> # </record>
It isn't magical. The namespaces that are in scope of something
are the default namespace and the xmlns:a
namespace. For example puts new_doc.root.children[1].namespaces.inspect
(children[0]
is a text node) prints both namespaces:
{"xmlns:a"=>"http://example.org/a", "xmlns"=>"http://example.org/top"}
Thanks for response!
Are you suggesting that MRI and Jruby are both correct, even though they don't match? I don't understand how this can be, I thought MRI and JRuby nokogiri implementations could be expected to match. To me, the MRI implementation is the most useful, as well as seeming correct. The JRuby implementation has changed between nokogiri 1.8 and 1.9, and neither of them match MRI. (Are you suggesting that MRI ought to match... JRuby 1.9, even though it never has?)
What I need to do is remove a sub-tree (a node and it's children) from a document, and make it into it's own document -- with all namespaces semantically intact. In MRI nokogiri, I can do this. In Java nokogiri 1.8, I could do it via a workaround (I considered it a workaround to a bug). In Java nokogiri 1.9+, I can find no way to do this at all. (Short of serializing to a string and re-parsing). That doesn't seem right to me.
I don't understand where
http://example.org/b
is coming from. Is that a copy paste error ?
It was, copy-pasted from the wrong example, as I was messing with different examples to try and isolate the error. (Not sure if it's more or less confusing for me to go edit my original text at this point?)
Are you suggesting that MRI and Jruby are both correct, even though they don't match ?
Yes. They are equivalent, AFAIK.
I don't understand how this can be, I thought MRI and JRuby nokogiri implementations could be expected to match.
Not necessarily, specially if the output XML is equivalent, such as this case.
To me, the MRI implementation is the most useful, as well as seeming correct.
I don't think I understand how the MRI output is more useful. I think this is worth clarifying.
What I need to do is remove a sub-tree (a node and it's children) from a document, and make it into it's own document -- with all namespaces semantically intact.
Isn't that exactly what the current version of Nokogiri does ? JRuby is adding the namespace declaration to the node itself, instead of adding the namesapce declaration on the parent node. Both implementations are correct 1.
[...]
In Java nokogiri 1.9+, I can find no way to do this at all. (Short of serializing to a string and re-parsing). That doesn't seem right to me.
Can you provide an example code to demonstrate how the JRuby's output is problematic. Is the consumer of the output assuming that doc.namespaces
will return all namespaces in the document ?
Basically I want to write code that uses namespaces that will work under MRI or JRuby, and write tests that will pass under either. I want equivalent behavior. Which I thought was the contract of nokogiri.
This includes add_namespace
having the same effect in either, namespaces
returning the same thing in either, and produced serialization from the same operations on the same input being equivalent in both.
But I see your point about the XML trees being semantically equivalent in both. (Note: semantically equivalent to a namespace-aware processor; probably not to a non-namespace-aware-processor that thinks "a:something" is just a tag name and "xmlns:a" just an attribute name.) I will spend some more with it and satisfy myself that this is true; it still wouldn't totally satisfy me, but I see your point.
It seems likely to me that there are cases where the JRuby noko 1.9 version would produce a serialization that is a lot more bytes than the MRI version. If there are a lot of a:something
elements, which in the JRuby version each get an xmlns
declaration serialized on that element.
Which suggests another point: Let's say I have an in-memory representation that matches:
<record>
<a:something xmlns:a="http://example.org/a">a:something</a:something>
<a:something xmlns:a="http://example.org/a">a:something</a:something>
</record>
If I want to "hoist" that xmlns:a=
decleration to be on the record
element instead, so that when it is serialized it's on the record
element, and not present on the a:something
elements -- should nokogiri give me API to do that?
I don't think nokogiri gives me API to do that. The nokogiri API doesn't let you alter namespace declarations much, if at all. At least in Jruby nokogiri, not totally sure about MRI nokogiri (if they differ, that goes back to the point about MRI/nokogiri divergence).
Thanks for your analysis @jvshahid !
I see how the documents are semantically identical. It could be (and was!) a lot worse.
I still think ideally:
Jruby behavior should match MRI. So test or other code written under MRI can pass in JRuby without changes, and JRuby nokogiri can be considered entirely compat with MRI.
Failing that (or for other reasons), there ought to be a way to use Nokogiri API on an in-memory representation to change what elements xlmns
declarations appear on. I don't believe there is any such API (def on Jruby, I tried lots of things, but I believe in MRI too). xmlns namespace declerations (and on what element they will be serialized when writing out) are essentially immutable.
However, in the current implementation, I've changed my code to test for "namespace-aware semantic equivalence", where it was failing because serializations were not identical.
(In particular, if you take an existing document, and do node.to_xml
, vs re-homing that node to new_document.root = node
and doing a to_xml
, you may see xmlns
attributes on different nodes in those two different serializations, in Jruby).
Here is the pretty hacky method I wrote to test "namespace-aware semantic equivalence" of two different nokogiri elements, in case it's useful to anyone.
def ns_semantic_equivalent_xml?(noko_a, noko_b)
noko_a = noko_a.root if noko_a.kind_of?(Nokogiri::XML::Document)
noko_b = noko_b.root if noko_b.kind_of?(Nokogiri::XML::Document)
noko_a.name == noko_b.name &&
noko_a.namespace&.prefix == noko_b.namespace&.prefix &&
noko_a.namespace&.href == noko_b.namespace&.href &&
noko_a.attributes == noko_b.attributes &&
noko_a.children.length == noko_b.children.length &&
noko_a.children.each_with_index.all? do |a_child, index|
ns_semantic_equivalent_xml?(a_child, noko_b.children[index])
end
end
Hi, sorry for not responding before now on this thread.
Jruby behavior should match MRI
Maybe in an ideal world; but the JRuby implementation is a completely separate implementation from the CRuby backend and the benefit in this case is low relative to the cost. The "correct" implementation could be argued either way, and I'm unaware of any XML spec that covers reparenting nodes into new documents.
I would welcome a pull request that cleaned this up, but otherwise I will not personally prioritize it. I'm sorry your experience isn't seamless across these implementations.
For posterity, the CRuby behavior is to create a doc that looks like:
<?xml version="1.0"?>
<record xmlns="http://example.org/top" xmlns:a="http://example.org/a">
<a:something>a:something</a:something>
</record>
and the JRuby behavior is:
<?xml version="1.0"?>
<record xmlns="http://example.org/top">
<a:something xmlns:a="http://example.org/a">a:something</a:something>
</record>
Thanks. If anyone can figure out any invocation to get the JRuby serialized doc to look like the CRuby one, like any methods I can call to make that happen, please do share! I would love to be able to create a doc by mutating existing documents, in Jruby, that would serialize the way the Cruby one does -- I actually used to have a workaround to do it prior to 1.9.0, but now can't figure out any way to do it at all, which is a problem. :(
This is a followup to #1774. I'm sorry it took me a while to get back around to it.
While a fix to #1774 was provided in #1778, the fix only works for a default namespace, not one with a prefix. (It hadn't occured to me in my repro for #1774 that I should provide both, as they may be have different codepaths in java nokogiri I guess? )
What is worse, there is a regression in 1.9.0+, possibly introduced by #1778, but I haven't verified this.
I used to be able to work around #1774, by explicitly calling
add_namespace
to add the namespaces I wanted on the rehomed element.However, my workaround that worked in JRuby 1.8.5 stopped working in JRuby 1.9.0+, so I'm kind of stymied and maybe even worse off now. :(
Reproductions are all in MRI 2.5 and JRuby 9.2.0.0, but I do not believe specific MRI/JRuby versions matter.