sparklemotion / nokogiri

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.
https://nokogiri.org/
MIT License
6.14k stars 899 forks source link

`Nokogiri::XML::Node#add_child_node_and_reparent_attrs` behaves incorrectly if an attribute name has a colon #1790

Open stevecheckoway opened 6 years ago

stevecheckoway commented 6 years ago

What problems are you experiencing? Nokogiri::XML::Node#add_child_node_and_reparent_attrs uses an incorrect test to see if a node's namespaces need to be reparented (at least, I think that's the purpose of this code).

It's testing a.name =~ /:/ to decide to reparent. This breaks HTML elements which are allowed to have colons in their attribute names. (Essentially, HTML only allows foreign elements to have explicit namespaces.)

It's a little convoluted to demonstrate the problem using the Nokogiri API.

require 'nokogiri'

doc = Nokogiri::HTML::Document.new
html = Nokogiri::XML::Element.new('html', doc)
html['a'] = 'en'
attr = html.attribute('a')
attr.name = 'xml:lang'

puts(html.attribute_nodes[0].inspect)
doc.add_child(html)
puts(html.attribute_nodes[0].inspect)

This prints

#<Nokogiri::XML::Attr:0x3ff75a84b6d8 name="xml:lang" value="en">
#<Nokogiri::XML::Attr:0x3ff75a84b340 name="lang" namespace=#<Nokogiri::XML::Namespace:0x3ff75a84b2dc prefix="xml" href="http://www.w3.org/XML/1998/namespace"> value="en">

I note that using doc.root = html doesn't do this reparenting.

Here's a backwards compatible fix.

require 'nokogiri'

module Nokogiri
  module XML
    class Node
      # HTML elements can have attributes that contain colons.
      # Nokogiri::XML::Node#[]= treats names with colons as a prefixed QName
      # and tries to create an attribute in a namespace. This is especially
      # annoying with attribute names like xml:lang since libxml2 will
      # actually create the xml namespace if it doesn't exist already.
      def add_child_node_and_reparent_attrs node
        add_child_node(node)
        node.attribute_nodes.find_all { |a| a.namespace }.each do |attr|
          attr.remove
          node[attr.name] = attr.value
        end
      end
    end
  end
end

doc = Nokogiri::HTML::Document.new
html = Nokogiri::XML::Element.new('html', doc)
html['a'] = 'en'
attr = html.attribute('a')
attr.name = 'xml:lang'

puts(html.attribute_nodes[0].inspect)
doc.add_child(html)
puts(html.attribute_nodes[0].inspect)

This prints

#<Nokogiri::XML::Attr:0x3fdb7a09b5b0 name="xml:lang" value="en">
#<Nokogiri::XML::Attr:0x3fdb7a09b5b0 name="xml:lang" value="en">

Ideally, an API for manipulating attributes in a given namespace that doesn't do the parsing based on colon (e.g., uses xmlNewProp rather than xmlSetProp) would be fantastic.

What's the output from nokogiri -v?

# Nokogiri (1.8.4)
    ---
    warnings: []
    nokogiri: 1.8.4
    ruby:
      version: 2.4.4
      platform: x86_64-darwin17
      description: ruby 2.4.4p296 (2018-03-28 revision 63013) [x86_64-darwin17]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/Users/steve/programming/nokogumbo/vendor/bundle/ruby/2.4.0/gems/nokogiri-1.8.4/ports/x86_64-apple-darwin17.4.0/libxml2/2.9.8"
      libxslt_path: "/Users/steve/programming/nokogumbo/vendor/bundle/ruby/2.4.0/gems/nokogiri-1.8.4/ports/x86_64-apple-darwin17.4.0/libxslt/1.1.32"
      libxml2_patches:
      - 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
      libxslt_patches: []
      compiled: 2.9.8
      loaded: 2.9.8

Can you provide a self-contained script that reproduces what you're seeing?

flavorjones commented 6 years ago

Thank you for submitting this! I'm just back from vacation and it will take a few days to catch up on everything. Thanks for your patience.

stevecheckoway commented 6 years ago

In retrospect, I'm not sure what this function is supposed to do at all. If I'm reading xml_node.c correctly (and I'm not at all sure that I am since I'm having a separate issue), all of the namespace reparenting should be happening in relink_namespace.

flavorjones commented 3 years ago

Hey, @stevecheckoway! Finally getting around to looking at this.

all of the namespace reparenting should be happening in relink_namespace

Yes, you're right. This method originally came from #870 which was a fix for #869; but the actual root cause was that relink_namespaces was bailing early -- skipping attributes -- if the element didn't have a namespace itself.

I just wrote #2310 which fixes the underlying implementation of relink_namespaces, but now we have a different issue, which is: there don't appear to be any/many tests of the HTML5::Node behavior implemented in 7b3c972. Hang on for one sec on that point ...

Side note: this refactor became much easier after #1712 was fixed by #2246 and #2228 was fixed by #2230, because the reparenting/namespacing behavior was much less buggy.

I'm not sure what this function is supposed to do at all

Again, this method came from #870 which was a fix for #869. Specifically, it's intended to handle cases like this:

Nokogiri::XML::Builder.new do |xml|
  xml.root('xmlns:foo' => 'bar') {
    xml.obj('foo:attr' => 'baz')
  }
end

We're trying to resolve this ambiguity:

If we were re-designing the Builder API, we could probably allow for explicitly setting the namespace for that attribute. However, there's an analogous use case that doesn't use Builder:

        doc1 = Nokogiri::XML('<root/>')
        doc2 = Nokogiri::XML('<root xmlns:foo="http://foo.io"/>')
        node = doc1.create_element('obj', 'foo:attr' => 'baz')
        # at this point, the attribute is named `foo:attr`
        # #<Nokogiri::XML::Element:0xba4 name="obj" attributes=[#<Nokogiri::XML::Attr:0xb90 name="foo:attr" value="baz">]>
        doc2.root.add_child(node)
        # but now the attribute is `attr` with a namespace
        # #<Nokogiri::XML::Element:0xba4 name="obj" attributes=[#<Nokogiri::XML::Attr:0xbcc name="attr" namespace=#<Nokogiri::XML::Namespace:0xbb8 prefix="foo" href="http://foo.io"> value="baz">]>

Your suggested fix breaks this "namespacing" process for the attribute in question.

Normally this is not an issue if the implied prefix doesn't correspond to a namespace in the document -- for example, using foo:lang instead of xml:lang correctly creates this attribute:

#<Nokogiri::XML::Attr:0x550 name="foo:lang" value="en">

but unfortunately in an XML document the xml prefix is reserved and will always resolve.

@stevecheckoway WDYT of an alternative approach: what if we skipped relink_namespace in HTML4+5 documents? I'm going to start a new branch where I try to write a test for 7b3c972

flavorjones commented 3 years ago

See #2310 for what I think is a good fix for all of this that in fewer lines of code.