sparklemotion / nokogiri

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.
https://nokogiri.org/
MIT License
6.14k stars 899 forks source link

[bug] DocumentFragment parse behavior wrt `body` tag #3023

Closed flavorjones closed 5 months ago

flavorjones commented 11 months ago

Please describe the bug

#!/usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri"
end

doc = <<~HTML
 <html>
   <body class="foo">
     <div>hello</div>
   </body>
 </html>
HTML

puts Nokogiri::HTML4::DocumentFragment.parse(doc).to_html
puts "---"
puts Nokogiri::HTML5::DocumentFragment.parse(doc).to_html

emits:

<body>

    <div>hello</div>

</body>
---

    <div>hello</div>

Expected behavior

I would expect:

Additional context

@marcoroth reported at RubyConf! woot

flavorjones commented 6 months ago

@marcoroth I spent some time on this today, and I have a potential workaround for you that I'd like your feedback on, and if it works for Stimulus Reflex then I'll put some work in to make it easier to use.

first some questions about the use case

The "fragment" above is not a fragment -- it's a complete document. If possible, in these cases, it's better to parse it with Document.parse than DocumentFragment.parse.

I'd like to understand the use case of Stimulus Reflex a bit better to understand if there's a better API that Nokogiri could be offering here. Do you really not know if the content is a document or a fragment?

transform the code into an equivalent

First let me convert the code above to use DocumentFragment.new instead of .parse:

#!/usr/bin/env ruby

require "bundler/inline"

gemfile do
  source "https://rubygems.org"
  gem "nokogiri", path: "."
end

html = <<~HTML
 <html>
   <body class="foo">
     <div>hello</div>
   </body>
 </html>
HTML

puts Nokogiri::HTML4::DocumentFragment.new(Nokogiri::HTML4::Document.new, html).to_html
puts "---"
puts Nokogiri::HTML5::DocumentFragment.new(Nokogiri::HTML5::Document.new, html).to_html

Note that the output is the same:

<body>

    <div>hello</div>

</body>
---

    <div>hello</div>

diagnosis

By default, both parsers will try to parse the fragment in the context of a body tag.

The HTML4 behavior that we see is due to a hack made a long time ago in c23eb88b: if a context node isn't specified, and there happens to be a line that matches /^\s*<body/i, then we return the tree including the body node.

If you change the script above to format the HTML slightly differently:

html = <<~HTML
 <html><body class="foo">
     <div>hello</div>
   </body>
 </html>
HTML

the output changes for the HTML4 fragment parser:


    <div>hello</div>

---

    <div>hello</div>

I really don't like this code very much, for a few reasons:

but it's been there for 14 years now and I can't really change it without breaking people's code.

We see what we see for the HTML5 fragment parser because it's working exactly as intended: we parse in the context of a body tag, and we return the children of that context node. IMHO this is the correct implementation and I'm hesitant to make it more complicated to try to match the hacky HTML4 implementation.

potential workaround

The HTML5::DocumentFragment.new method supports a "context node" argument, so we can tell the parser more exactly what we want to do. In the above case, because it's a complete document and not a fragment, I think what we really want to do is parse in the context of an html tag, so let's try that:

puts Nokogiri::HTML5::DocumentFragment.new(Nokogiri::HTML5::Document.new, html, "html").to_html

will output

<head></head><body class="foo">
    <div>hello</div>

</body>

A couple of things to call out:

html = <<~HTML
 <html>
   <head>
     <link rel="stylesheet" href="mystyle.css">
   </head>
   <body class="foo">
     <div>hello</div>
   </body>
 </html>
HTML

# ...

outputs

<body>

    <link rel="stylesheet" href="mystyle.css">

    <div>hello</div>

</body>
---
<head>
    <link rel="stylesheet" href="mystyle.css">
  </head>
  <body class="foo">
    <div>hello</div>

</body>

Perhaps naively, the HTML5 parser's output matches the input more exactly than the HTML4 parser's output. But I have no idea whether this is helpful for Stimulus Reflex's use case.

Does this work for you? If not, why not? If so, then I'll do some work to make the context node input available to callers of DocumentFragment.parse.

marcoroth commented 6 months ago

Hey @flavorjones, thanks for taking a look at this!

The "fragment" above is not a fragment -- it's a complete document. If possible, in these cases, it's better to parse it with Document.parse than DocumentFragment.parse.

Yeah this is true, ideally we could pick the right class to parse an HTML document. We have had efforts (like https://github.com/stimulusreflex/stimulus_reflex/pull/601, https://github.com/stimulusreflex/stimulus_reflex/pull/622) that tried to pick the right class depending on the context.

But sadly we haven't found one Nokogiri class/method combination that solves all the behaviors we saw/needed. It seems like that we would need a combination of the parts of behavior we saw in these classes:

To explain the use-case a bit more: StimulusReflex exposes a morph method that takes in two arguments. The first argument can be any CSS selector and the second argument is an HTML string.

Usually we see people do thing like:

morph "#my-element", %(<div id="my-element>My element is updated!</div>)

In this case we'd want to use HTML5::DocumentFragment.parse(...) and that works perfectly fine. But there are also valid use-cases where people do something like:

morph "body", %(<body class="body with updated classes">...</body>)

In this case HTML5::DocumentFragment.parse(...) wouldn't work anymore because it strips elements/attributes and transforms the document.

In the phlexing gem I solved it by doing something like this, but this feels somehow hacky:

https://github.com/marcoroth/phlexing/blob/90b3af31ea56b70c101d5775d09f6f2c43a241c7/gem/lib/phlexing/parser.rb#L15-L29

Ideally Nokogiri would expose an API that would parse HTML as-is. So that it parses the HTML without having any spec in mind. And so, that it also wouldn't try to make it spec-compliant. I guess you could say we are looking for something like an AST for HTML that returns a representation of whatever you give it, be it valid/spec-conform or not.

But I would totally understand if this is out-of-scope for a gem like Nokogiri, which is also why I mentioned in https://github.com/stimulusreflex/stimulus_reflex/issues/652 that it might make sense to move off of Nokogiri.

The problem is that Rails doesn't really care if you ship spec-compliant HTML to your browser through a Rails request/response cycle. But since StimulusReflex tries to dynamically update the rendered page by re-rendering the view and by passing it through Nokogiri it breaks the flow if you have non-spec-compliant HTML.

If people would write HTML5-compliant HTML views we wouldn't have this discussion, but since modern browsers got so relaxed with processing invalid HTML it now gets into the way when StimulusReflex tries to update the view using Nokogiri. Maybe we can also do something in Rails that warns people if their controller action returns invalid HTML?

Potential workaround: The HTML5::DocumentFragment.new method supports a "context node" argument

While I think that a "context node" argument would be useful I don't think it really solves our problem, since we'd still have to know which the right context node is given an arbitrary string of HTML.

Working on this parsing topic through different projects and requirements (besides StimulusReflex) I think there's a bigger need for a tool/parser that goes beyond Nokogiri and even Ruby.

I'm really unsure how we can and should proceed here. Maybe it would make sense to talk it through face-to-face so we don't talk past each other. Thank you for the support so far!

flavorjones commented 6 months ago

@marcoroth Let's chat in slack, I think we'd benefit from higher-fidelity conversation.

flavorjones commented 5 months ago

Circling back to close this off: after a good long chat, I think I understand the StimulusReflex use case much better, and I think Marco understands the challenges presented by the HTML5 spec a bit better.

In summary, I think we agreed on this set of problem statements, which Nokogiri doesn't help answer today:

  1. is the input a document or a fragment?
  2. if it's a fragment, what should the context node be?
  3. if additional parent nodes are created, how do we select the correct child nodes?

Figuring out how to do this as a user of Nokogiri's API is just too hard today. So I made a thing to try to ease this pain: https://github.com/flavorjones/nokogiri-html5-inference

I'm going to close this ticket, and hopefully we can continue the conversation about what's working or not in an issue on that repo!

Thanks again for bringing this up, @marcoroth!

marcoroth commented 5 months ago

Thanks for the summary and for talking this through @flavorjones, it's really appreciated! 🙏🏼

stevecheckoway commented 5 months ago

I spent a little time testing if this could be parsed as a fragment with a template node as the context but unfortunately a body start tag is ignored in that case.

Parsing using a template node as the context node may help the other cases like parsing td elements.