rubys / nokogumbo

A Nokogiri interface to the Gumbo HTML5 parser.
Apache License 2.0
186 stars 114 forks source link

coerce(str) results in Text node rather than html #160

Closed dan42 closed 3 years ago

dan42 commented 3 years ago

I ran into a surprise when trying to insert content into a document using element.before("<div/>") For regular elements it works, but if it's a script element, "<div/>" is escaped and inserted as text. Apparently due to coerce behaving weirdly for this case?

Nokogiri.HTML('<script>').at('//script').send(:coerce, '<div/>')
#=> [#<Nokogiri::XML::Element:0x2adb350529dc name="div">]

Nokogiri.HTML5('<script>').at('//script').send(:coerce, '<div/>')
#=> [#<Nokogiri::XML::Text:0x2adb3505bb2c "<div>">]
Nokogiri.HTML5('<span>').at('//span').send(:coerce, '<div/>')
#=> [#<Nokogiri::XML::Element:0x2adb3505b014 name="div">]
stevecheckoway commented 3 years ago

I think the source of this bug is in Nokogiri but not in a way that matters for Nokogiri.

What's happening is that when #before is called, it ends up calling #add_sibling which calls #coerce, as you've discovered. #coerce sees that the input is a string and then calls #fragment on it.

Here's the documentation for #fragment.

      ###
      # Create a DocumentFragment containing +tags+ that is relative to _this_
      # context node.

And this explains what's happening. self is a script element. When Nokogumbo parses an HTML fragment with a script element as the context element, it does what the standard says to do in this case, namely to start tokenizing in the script data state. Inside a script element, <div/> has no special meaning, so it returns a text node.

I think what should happen is the parent element's #coerce should be called rather than the current element. That is the #add_sibling should be changed from

      def add_sibling(next_or_previous, node_or_tags)
        impl = (next_or_previous == :next) ? :add_next_sibling_node : :add_previous_sibling_node
        iter = (next_or_previous == :next) ? :reverse_each : :each

        node_or_tags = coerce node_or_tags
        ...

to

      def add_sibling(next_or_previous, node_or_tags)
        impl = (next_or_previous == :next) ? :add_next_sibling_node : :add_previous_sibling_node
        iter = (next_or_previous == :next) ? :reverse_each : :each

        node_or_tags = parent.coerce node_or_tags
        ...

I verified that passing the parent element as the context element for the DocumentFragment works in this case.

irb(main):001:0> doc = Nokogiri.HTML5('<script>')
=> #<Nokogiri::HTML5::Document:0x4c4 name="document" children=[#<Nokogiri::XML::Element:0x4b0 name="html" children=[#<Nokogiri::XML::Element:0x488 name="he...
irb(main):002:0> doc.at('//script').before('<div/>')
=> #<Nokogiri::XML::Element:0x474 name="script">
irb(main):003:0> doc.to_s
=> "<html><head><div></div><script></script></head><body></body></html>"
irb(main):004:0> doc
=> #<Nokogiri::HTML5::Document:0x4c4 name="document" children=[#<Nokogiri::XML::Element:0x4b0 name="html" children=[#<Nokogiri::XML::Element:0x488 name="head" children=[#<Nokogiri::XML::Element:0x4d8 name="div">, #<Nokogiri::XML::Element:0x474 name="script">]>, #<Nokogiri::XML::Element:0x49c name="body">]>]>

but I did that by modifying Nokogiri::HTML5::Node#fragment which is going to be incorrect in general.

I'm not sure why this doesn't affect Nokogiri. Maybe the rules are different for earlier HTML or maybe there's some bug in xmlParseInNodeContext.

@flavorjones Do you have any thoughts on this? Should I create a PR with this change for Nokogiri? I'm not really sure how I would test it though.

flavorjones commented 3 years ago

Hey, @stevecheckoway. I agree with your take, that Nokogiri should be calling #coerce on the parent node when adding siblings.

I'm looking at how to test this by examining what's going wrong. You're right that xmlParseInNodeContext does not differentiate between a script tag and any other element when it parses in-context, which is why we haven't caught this bug before now.

I'm going to spike on a testing approach that will assert that the method is called on the parent, and see how I feel about it.

flavorjones commented 3 years ago

@stevecheckoway PR https://github.com/sparklemotion/nokogiri/pull/2116 is making its way through CI now. If it goes green, and I think it will, it'll be in v1.11.0.rc4, hopefully within the next week.

stevecheckoway commented 3 years ago

@flavorjones Great! Thanks so much for looking into this and getting it fixed so quickly.

flavorjones commented 3 years ago

@stevecheckoway I'm going to spend some time today getting nokogiri master in shape to integration-test this with nokogumbo. I'm going to reopen this until I can do that, if that's OK?

EDIT: the work I'm referring to is at https://github.com/sparklemotion/nokogiri/pull/1788

stevecheckoway commented 3 years ago

By all means!

dan42 commented 3 years ago

Thanks to everyone for the very fast fix!

flavorjones commented 3 years ago

OK, I'm going to close this since Nokogiri master is once again compatible with Nokogumbo, and I've also got a draft PR at #163 to further improve the integration. Thanks all for your patience!