sparklemotion / nokogiri

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.
https://nokogiri.org/
MIT License
6.14k stars 899 forks source link

Attribute inconsistency caused by `remove_namespaces!` #1157

Open DavidEGrayson opened 10 years ago

DavidEGrayson commented 10 years ago

Hello. I'd like to report a bug in Nokogiri that is exhibited by the following code:

require 'nokogiri'
doc = Nokogiri::XML('<a b:c="" b:d=""/>')
doc.remove_namespaces!
node = doc.xpath('//a').first
p node
p node['d']
p node.attributes['d'].value

I would expect the last two lines to both print "" to the console, because I would think that they are equivalent ways to access the "d" attribute.

When I run the code in JRuby ( jruby 1.7.13 (1.9.3p392) 2014-06-24 43f133c on Java HotSpot(TM) Client VM 1.7.0_55-b14 [Windows 8-x86] ), I get the following output:

#<Nokogiri::XML::Element:0x832 name="a" attributes=[#<Nokogiri::XML::Attr:0x82e name="d">, #<Nokogiri::XML::Attr:0x830 name="c">]>
nil
""

When I run the code in MRI ( ruby 2.1.2p95 (2014-05-08 revision 45877) [i686-linux] ) I get the following output:

#<Nokogiri::XML::Element:0x..fdc44715c name="a" attributes=[#<Nokogiri::XML::Attr:0x..fdc4470d0 name="b:c">, #<Nokogiri::XML::Attr:0x..fdc4470c6 name="b:d">]>
nil
/c/Users/David/Documents/rpicstatic/test2.rb:7:in `<main>': undefined method `value' for nil:NilClass (NoMethodError)

I think that the remove_namespaces! method is buggy because the problem seems to go away if I stop calling that and refer to everything by its full name.

In both cases, I was using Nokogiri version 1.6.3.1.

flavorjones commented 9 years ago

Hi,

Thanks for reporting this. I agree that this is probably a bug in how #remove_namespaces! is hacking away at the DOM. Thanks for providing a test case; we'll look into fixing for 1.6.6.

flavorjones commented 9 years ago

OK, this is a weird one, and there are a couple of things being conflated. Let me try to break them out one by one, and then we'll address the inconsistent behavior.

But first, let me point out that the only bug here is inconsistent behavior, as we invented #remove_namespaces!, e.g. it's not defined in the XML specification.

First, let's examine the MRI behavior for your example, but let me assert on the setup:

doc = Nokogiri::XML <<EOXML
  <a b:c="thing-c" b:d="thing-d"/>
EOXML
puts "errors: #{doc.errors}"

node = doc.xpath('//a').first
puts node.inspect
puts node['d'].inspect
puts node.attributes['d'].inspect

doc.remove_namespaces!
puts "---"

puts node.inspect
puts node['d'].inspect
puts node.attributes['d'].inspect

MRI outputs:

errors: [#<Nokogiri::XML::SyntaxError: Namespace prefix b for c on a is not defined>, #<Nokogiri::XML::SyntaxError: Namespace prefix b for d on a is not defined>]
#<Nokogiri::XML::Element:0xff2ee0 name="a" attributes=[#<Nokogiri::XML::Attr:0xff2ddc name="b:c" value="thing-c">, #<Nokogiri::XML::Attr:0xff2dc8 name="b:d" value="thing-d">]>
nil
nil
---
#<Nokogiri::XML::Element:0xff2ee0 name="a" attributes=[#<Nokogiri::XML::Attr:0xff2ddc name="b:c" value="thing-c">, #<Nokogiri::XML::Attr:0xff2dc8 name="b:d" value="thing-d">]>
nil
nil

So you should notice two things:

  1. there are namespace errors when parsing the document, and
  2. the state of the document is not changed by calling #remove_namespaces!

Now let's see the JRuby output:

errors: [#<Nokogiri::XML::SyntaxError: The prefix "b" for attribute "b:c" associated with an element type "a" is not bound.>, #<Nokogiri::XML::SyntaxError: The prefix "b" for attribute "b:d" associated with an element type "a" is not bound.>]
#<Nokogiri::XML::Element:0x7d4 name="a" attributes=[#<Nokogiri::XML::Attr:0x7d0 name="c" value="thing-c">, #<Nokogiri::XML::Attr:0x7d2 name="d" value="thing-d">]>
nil
#<Nokogiri::XML::Attr:0x7d2 name="d" value="thing-d">
---
#<Nokogiri::XML::Element:0x7d4 name="a" attributes=[#<Nokogiri::XML::Attr:0x7d2 name="d" value="thing-d">, #<Nokogiri::XML::Attr:0x7d0 name="c" value="thing-c">]>
nil
#<Nokogiri::XML::Attr:0x7d2 name="d" value="thing-d">

You'll notice that JRuby is inconsistent with MRI when parsing a document with incorrectly-namespaced entities. MRI/libxml2 will name the attribute "b:d", while JRuby/xerces will name it "d". There are other differences in how the parsers "fix up" broken markup, and this is just one example.

The historical convention on this project, which you may or may not agree with, is to treat inconsistencies in how broken markup is fixed as not-a-bug.

But let's fix up the markup to correctly declare namespaces, and examine the output from each port:

doc = Nokogiri::XML <<EOXML
<root xmlns:a="http://a.flavorjon.es/" xmlns:b="http://b.flavorjon.es/">
  <a b:c="thing-c" b:d="thing-d"/>
</root>
EOXML
puts "errors: #{doc.errors}"

node = doc.xpath('//a').first
puts node.inspect
puts node['d'].inspect
puts node.attributes['d'].inspect

doc.remove_namespaces!
puts "---"

puts node.inspect
puts node['d'].inspect
puts node.attributes['d'].inspect

MRI:

errors: []
#<Nokogiri::XML::Element:0x131d200 name="a" attributes=[#<Nokogiri::XML::Attr:0x131d19c name="c" namespace=#<Nokogiri::XML::Namespace:0x131d534 prefix="b" href="http://b.flavorjon.es/"> value="thing-c">, #<Nokogiri::XML::Attr:0x131d188 name="d" namespace=#<Nokogiri::XML::Namespace:0x131d534 prefix="b" href="http://b.flavorjon.es/"> value="thing-d">]>
nil
#<Nokogiri::XML::Attr:0x131d188 name="d" namespace=#<Nokogiri::XML::Namespace:0x131d534 prefix="b" href="http://b.flavorjon.es/"> value="thing-d">
---
#<Nokogiri::XML::Element:0x131d200 name="a" attributes=[#<Nokogiri::XML::Attr:0x131d19c name="c" value="thing-c">, #<Nokogiri::XML::Attr:0x131d188 name="d" value="thing-d">]>
"thing-d"
#<Nokogiri::XML::Attr:0x131d188 name="d" value="thing-d">

JRuby:

errors: []
#<Nokogiri::XML::Element:0x7d6 name="a"attributes=[#<Nokogiri::XML::Attr:0x7d2 name="c" namespace=#<Nokogiri::XML::Namespace:0x7d0 prefix="b" href="http://b.flavorjon.es/"> value="thing-c">, #<Nokogiri::XML::Attr:0x7d4 name="d" namespace=#<Nokogiri::XML::Namespace:0x7d0 prefix="b" href="http://b.flavorjon.es/"> value="thing-d">]>
nil
#<Nokogiri::XML::Attr:0x7d4 name="d" namespace=#<Nokogiri::XML::Namespace:0x7d0 prefix="b" href="http://b.flavorjon.es/"> value="thing-d">
---
#<Nokogiri::XML::Element:0x7d6 name="a" attributes=[#<Nokogiri::XML::Attr:0x7d4 name="d" value="thing-d">, #<Nokogiri::XML::Attr:0x7d2 name="c" value="thing-c">]>
nil
#<Nokogiri::XML::Attr:0x7d4 name="d" value="thing-d">

You'll notice that the behavior is somewhat inconsistent, in that after #remove_namespaces the structures match between MRI and JRuby. However, the output of node['d'] differs in that MRI returns "thing-d" and JRuby return nil. This is a bug.

DavidEGrayson commented 9 years ago

Thanks for looking into it and identifying what the bug really is.

flavorjones commented 3 months ago

I suspect this is related to the issue described in #1012