rubys / nokogumbo

A Nokogiri interface to the Gumbo HTML5 parser.
Apache License 2.0
186 stars 114 forks source link

Would you consider accepting native sanitization? #164

Closed ghost closed 3 years ago

ghost commented 3 years ago

I have made available the beginnings of a gem which uses the nokogumbo parsing, but allows for HTML sanitization in native C: https://github.com/justacman/cleanse

The test suite is identical to that of sanitize, and the API is simple, taking in a hash of config options also similar to sanitize, for example:

sanitizer = Cleanse::Sanitizer.new({ allow_doctype: false, elements: ["html"] })
assert_equal("<html>foo</html>", Cleanse::Document.new("<!DOCTYPE html><html>foo</html>", sanitizer: sanitizer).to_html)
assert_equal("foo", Cleanse::DocumentFragment.new("<!DOCTYPE html>foo", sanitizer: sanitizer).to_html)

input = '<a href="hTTpS://foo.com/">Text</a>'

sanitizer = Cleanse::Sanitizer.new(
    elements: ["a"],
    attributes: { "a" => ["href"] },
    protocols: { "a" => { "href" => ["https"] } }
  )

assert_equal(input, Cleanse::DocumentFragment.new(input, sanitizer: sanitizer).to_html)

input = '<a href="mailto:someone@example.com?Subject=Hello">Text</a>
assert_equal("<a>Text</a>", Cleanse::DocumentFragment.new(input, sanitizer: sanitizer).to_html)

Benchmark wise it is much faster than current Ruby implementations. Also right now it only serializes back into a string in to_html, but I want to get back a Nokogumbo/Nokogiri interface. I do not think the world needs two nokogiri interface gems, hence, my suggestion to integration the sanitization directly into this library.

The API could remain similar:

sanitizer = Nokogumbo::Sanitizer.new(
    elements: ["a"],
    attributes: { "a" => ["href"] },
    protocols: { "a" => { "href" => ["https"] } }
  )

Nokogiri::HTML5::DocumentFragment.parse(fragment, encoding = nil, options = { sanitizer: sanitizer })

wdyt?

flavorjones commented 3 years ago

:wave: I'm not a Nokogumbo contributor, but I do work closely with the team as the primary Nokogiri maintainer. Full disclosure, we've been discussing plans to merge Nokogumbo into Nokogiri this year, and that's the lens through which I'm looking at this request.

While I understand there are some potential benefits to integrate sanitization into Nokogumbo, I do worry about a couple of things.

One is that the cleanse library is brand-new and has no users yet, and so Nokogumbo would be taking on additional complexity without really understanding whether the code has commensurate value. I'd prefer to see this code get released separately and see the library build up a user base before considering merging it in.

Another concern is that this extends the scope of Nokogumbo (and eventually Nokogiri) beyond simply parsing HTML5. Which might be fine? But I broke out Loofah as a separate concern because I predicted (accurately, as it turns out) that it would have a very different release cycle than Nokogiri; and also because the domain could be modeled efficiently as a dependent gem.

What I'd really find valuable is a discussion -- here or elsewhere -- about the C API that Nokogiri and Nokogumbo should make available to enable this library to exist separately, while still running quickly and keeping the codebase as simple as reasonable. Can you talk to us a little bit about the C-level dependencies, at both compile time and link time?

rubys commented 3 years ago

FWIW, if Nokogiri were to subsume Nokogumbo's functionality (and by that I mean either by incorporation of Nokogumbo's code base or via a different implementation of HTML5), then I personally don't see a need for Nokogumbo's continued existence for any reason other than as a transition aid.

If you accept that premise, this reduces the question above to whether sanitation should be included in Nokogiri.

I'm not aware of any C-level APIs made available by Nokogumbo. Given that Nokogumbo's implementation is quite different based on whether or not the libxml libraries are present, this would pose a challenge. (By the way, the code to handle not being able to find the libraries can be jettisoned if incorporated into Nokogiri).

rubys commented 3 years ago

Wow, I take it back. The cleanse gem does use gumbo.h. I would have thought it better to start with the libxml API.

ghost commented 3 years ago

One is that the cleanse library is brand-new and has no users yet, and so Nokogumbo would be taking on additional complexity without really understanding whether the code has commensurate value. I'd prefer to see this code get released separately and see the library build up a user base before considering merging it in.

understood, but this achievement would take years, no? I saw a missing need for safe and fast HTML sanitization. It has run against Valgrind and ASAN with no issues. Not sure what else to say to ask for a closer look. Ruby projects is already relying on slower Ruby-only sanitization so the likelihood of switching is not reasonable.

What I'd really find valuable is a discussion -- here or elsewhere -- about the C API that Nokogiri and Nokogumbo should make available to enable this library to exist separately, while still running quickly and keeping the codebase as simple as reasonable. Can you talk to us a little bit about the C-level dependencies, at both compile time and link time?

The only dependency is Gumbo parser at compile time. All the API methods used as @rubys note is straight from gumbo -- gumbo_element_remove_attribute_at, gumbo_destroy_node.

On Tue, Feb 16, 2021 at 5:52 PM Sam Ruby notifications@github.com wrote:

Wow, I take it back. The cleanse gem does use gumbo.h. I would have thought it better to start with the libxml API.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rubys/nokogumbo/issues/164#issuecomment-780010045, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS3VTUTAFWOJPHZPWDFIYFDS7KWERANCNFSM4XVWJSVA .

flavorjones commented 3 years ago

@justacman Just to be clear: I appreciate the thought and effort that you've put in so far; and I appreciate that you didn't just show up with an idea, you showed up with working code. Thank you!

But I think we're trying to have two different conversations:

These framings reflect very different perspectives and assumptions, and I'm not entirely sure a github issue thread is the best way for us to work through such a complex conversation.


Here's what I'd like to propose:

and then maybe we can pick up this conversation again later in the year if it still seems valuable to you. Does that seem fair?