RFC: Explore alternatives to libxml2 for HTML parsing

flavorjones commented 4 years ago

Background

libxml2 supports parsing of HTML4. But the most popular browsers have moved on to HTML5 and exhibit different behavior than libxml2 when parsing documents, particularly when fixing up broken markup and when interpreting comments, but also for a bunch of edge cases that the WHATWG makes non-normative recommendations on how to handle.

WHATWG recommendations: https://html.spec.whatwg.org/multipage/parsing.html
One example of a dangerous difference in behaviors: https://github.com/sparklemotion/nokogiri/pull/2058
Another example of a dangerous difference in behaviors: https://bugzilla.gnome.org/show_bug.cgi?id=769760

where "dangerous" means, basically, "vector for an XSS attack".

Why libxml2 isn't likely to get better at HTML5

The libxml2 code base is positively ancient by HTML5 time scales, and it would be challenging to update it to handle new tags, new allowed attributes, WHATWG parsing recommendations, and potentially other things I haven't even thought of. So it's very understandable that the libxml2 maintainers haven't really expressed an interest, or have expressed a disinterest, in updating the library.

Comment by libxml2 maintainer: "it's still a bad idea to use a 20+ years old, unmaintained HTML 4 parser to sanitize input for the modern web" (1)
- https://bugzilla.gnome.org/show_bug.cgi?id=769760#c4
Comment by the same libxml2 maintainer in a different issue: "some of my changes also made libxml2 more aligned with HTML5. But it's still a really bad idea to sanitize HTML with libxml2. Almost every divergence from the HTML5 parsing rules can be abused to hide malicious content. For example ..." (2)
- https://gitlab.gnome.org/GNOME/libxml2/-/merge_requests/82#note_886120

This is totally fine! libxml2 does what it does well, which is parsing XML 1.0 and HTML 4.0, and we shouldn't throw any shade their way for steadfastly maintaining a successful and stable OSS project. Thank you, libxml2 maintainers, for doing an amazing job.

Users are asking for HTML5 support

But just because libxml2 can't do the job doesn't mean the job isn't important. Look at this search for HTML5-related issues reported on Nokogiri! This is evidence that Nokogiri is losing relevance as more of the world moves to HTML5.

Alternatives

Nokogumbo

The awesome Nokogumbo project was created to help: it uses the Gumbo parser to parse HTML5 documents, then transmogrifies the Gumbo document structure into the libxml2 document structure so that the rest of Nokogiri can be used to manipulate and serialize the doc. This is amazing!

Unfortunately, the Gumbo parser isn't well-maintained outside of the Nokogumbo project, which is continuing to make updates to their vendored fork of the codebase. But does that matter if HTML5 is relatively static? We should have a conversation about it.

Another interesting thing is that Nokogumbo is compiling the Gumbo codebase (just like Nokogiri is compiling the libxml2/libxslt codebases) at installation time. If we made Nokogumbo a dependency of Nokogiri, that would make some of our users Even More Unhappy with installation times of their tech stack.

We could share Nokogiri's approach for building precompiled windows libraries (and soon-to-ship precompiled linux libraries and hopefully someday precompiled Darwin/OSX libraries) with the Nokogumbo team, but that still wouldn't help our JRuby friends, who depend on Nokogiri bundling NekoHTML (btw also not-well-maintained). Or maybe we could discuss merging the two projects together? (Nokogiri and Nokogumbo). I'm going to tag a few Nokogumbo folks to see what they think.

Another parsing library

Other options I've briefly looked at:

But we'd be starting from scratch in trying to wrap these up, and that feels like potentially a huge time sink.

Timelines

I'm tagging this for the v1.12.0 milestone for a few reasons:

v1.11.0 is already really dragging and I don't want to add scope to it
v1.11.0 should have precompiled binaries for all the major platforms (including OSX if we're lucky) which would be very useful when tackling adding another parser codebase for HTML5

So, to sum up and rephrase the broad strokes of the roadmap:

v1.11.0 is focused on shipping precompiled libraries to improve installation time and decrease installation issues for users
v1.12.0 would be focused on expanding the breadth of parsing functionality to include HTML5

Comments welcome.

flavorjones commented 4 years ago

Tagging @stevecheckoway, @craigbarnes, and @rubys for awareness that I've been thinking about this, and to potentially discuss how Nokogiri and Nokogumbo can work closer and better together in the future!

Tagging @tenderlove and @ptoomey3 because we were just on the twitters talking about this.

stevecheckoway commented 4 years ago

I'm glad you're thinking about this!

One problem is that HTML 5 isn't static. It's a living standard and the parsing steps have undergone changes during the time I've been working on Nokogumbo. I spent a while updating Gumbo to match the standard and added extensive unit tests. I tried to test every case in the spec. That was a nightmare; I'm sure I missed some. Since that point, the standard has changed again (I think). Some HTML elements have been deprecated, for example.

I've given thought to trying to parse the standard itself and generate code for the tokenization and tree-building state machines, but that's a pretty big project and I definitely don't have time for it.

flavorjones commented 4 years ago

@stevecheckoway Well, today I learned how much the HTML5 spec is changing over time. Thank you for the effort and for making Nokogumbo such great software!

I guess one advantage of moving to a fully-maintained parser like Gecko's or Blink's is that we wouldn't have to go through that work. Maybe that's worth looking into as an alternative?

I had an email conversation with @rubys in Nov 2010 about potentially using validator.nu for HTML5 parsing ... it's Java (which we could use to replace NekoHTML in the JRuby implementation) and can generate C++ code (which is what Gecko's HTML parser is based on) meaning the JRuby and CRuby behaviors would be identical.

CCing @hsivonen who I'm sure has thought deeply about ideas like this over the years, in case he's interested in following along or dropping some knowledge.

hsivonen commented 4 years ago

Back in 2010 my plan was to generate a non-Gecko C++ version of the Validator.nu HTML Parser. That ended up not happening due to tasks that were of higher priority to Mozilla.

Instead @rubys added a to use the Validator.nu HTML Parser from Ruby via gcj. gcj itself is now gone. Also, I believe that @rubys migrated his app to use Gumbo.

One alternative to adding an non-Gecko target for the C++ translation of the Validator.nu HTML Parser and to using Gumbo after updating it to spec would be to create an FFI wrapper for html5ever.

rubys commented 4 years ago

Looks like there is a new library, which presumably would require no compilation and would work on all Ruby platforms including JRuby: https://github.com/namusyaka/gammo

stevecheckoway commented 4 years ago

Oh neat! One possible concern for Nokogiri is that we (Nokogumbo) have already gotten push back for implementing the HTML 5 serialization algorithm in Ruby as being too slow. I imagine there would be similar concerns about switching to a pure Ruby parsing library. (My personal use case would be completely unaffected by such a change.)

flavorjones commented 4 years ago

I don't think Nokogiri should choose a pure-Ruby implementation, because of performance concerns. Nokogiri is in the hot path for lots of Rails applications at this point (via the dependency chain of rail-html-sanitizer → loofah → nokogiri) and so I don't think we'd be meeting or exceeding expectations unless we have roughly comparable performance (speed and memory) when parsing.

I'll also note that we implemented an FFI version of Nokogiri but ripped it out in 1.5.0 for a variety of reasons including difficulty debugging memory issues and performance (both of which may now, nine years later, be addressed). I explained this more at length in this GoGaRuCo 2013 talk starting at 21:49.

ffi-add-and-delete

:rofl: :sob:

ptoomey3 commented 4 years ago

I spent a while updating Gumbo to match the standard and added extensive unit tests.

Is that to say that upstream Gumbo isn't keeping pace with the rate of change? Is it abandoned, casually maintained, etc? Given nokogiri has gotten by with a more or less an unmaintained library for html, it might still be a net win to move to gumbo so long as it is kept at least semi-current.

rubys commented 4 years ago

A lot of Rails users care about JRuby support, which has yet to be implemented for nokogumbo: https://github.com/sparklemotion/nokogiri/issues/2064.

stevecheckoway commented 4 years ago

@ptoomey3, my understanding is that it's completely unmaintained. I think what happened is a Google engineer built it in their 20% time and then either stopped or is no longer at Google. @craigbarnes put a lot of work into updating it and making it faster/more maintainable as part of lua-gumbo.

I took a bunch of those changes for our fork in Nokogumbo. I believe the lua-gumbo and nokogumbo implementations have diverged (particularly the test harness). Craig and I have been notifying each other of important updates we've made, however.

The official gumbo repo hasn't been touched in 5 years and contains several security vulnerabilities¹ that have been fixed in lua-gumbo and nokogumbo.

¹ I didn't check if they were exploitable.

rubys commented 4 years ago

What would be the next step? How can I help?

Putting aside the JRuby problem for the moment, it seems like nokogumbo is the logical next step; and it wouldn't be one that would preclude a Gecko/Servo/Chromium based replacement in the future.

flavorjones commented 4 years ago

@rubys - I agree that leveraging Nokogumbo is the easiest next step (and potentially the only step, at least for another 10 years :laughing: :sob:).

To be clear: a consequence of this decision is for Loofah and Nokogiri to start using Nokogumbo for HTML parsing, meaning that Nokogumbo will become a dependency of Rails, and almost certainly will have an increased support burden as a result.

None of us is paid for our work. How can we preserve maintainer happiness on both projects while maintaining an acceptable level of support for the Rails community?

Proposed next steps

I want to be very respectful of boundaries here. The Nokogumbo contributors have filled this functionality gap very well and I don't want to presume that consolidation or assimilation of either project is necessary or desirable.

If it would be easier to shift this discussion into a realtime face-to-face conversation, I will gladly do that.

1. 2020:

Status quo for a few more months. Ship Nokogiri v1.11.0 towards the end of 2020 with precompiled native gems for windows, linux, and osx/darwin.

Ensure that Nokogumbo installation doesn't break (see https://github.com/rubys/nokogumbo/pull/148 and https://github.com/sparklemotion/nokogiri/pull/1788).

2. Decision point:

After the precompiled native gem release has baked for a bit and we've learned more about whether it was a good decision or not and how hard it is to maintain this approach -- maybe in early 2021? -- make a decision, involving the key stakeholders from both projects, about whether to merge the projects.

3.a. No, don't merge:

If we decide not to merge, then we firm up the contract between the two gems -- both at installation and at runtime -- and invest in integration testing. [1]

Nokogiri and Loofah use Nokogumbo for HTML5 parsing, and update their tests to match the new parser behavior. These gems cut a release, adding Nokogumbo to Rails's dependency graph.

3.b. Yes, please merge:

If we decide to merge, then I'd like to suggest Nokogumbo merge into Nokogiri, rather than both merging into a net-new gem; but I am open to a conversation about other approaches. Please let me know if this is a contentious idea, and I'm happy to jump on a realtime call to discuss it.

The core contributors for the merged gem would be the set-union of the Nokogumbo and Nokogiri core contributors, and the commit history for both projects would be preserved. We would make a big deal out of it, and graciously name and thank the people who've made the projects a success. Yay us!

Immediate work would be to functionally integrate the code, and then get precompilation (native gems) working and tested for Nokogiri::HTML5 and Gumbo for a Nokogiri v1.12.0 release -- or potentially a v2.0 release? -- in 2021.

Brief thoughts on our options

My mental model is such that I think it makes most sense to merge (3.b.), but would really like the key Nokogumbo creators and contributors to talk about their own mental models, opinions, fears, etc.

I suspect that keeping the gems separate is going to lead to unhappiness among the maintainers. The Nokogumbo maintainers will inevitably shift focus from "making HTML5 parsing work in Ruby" to "making this C codebase compile and install on an infinite variety of ill-configured systems" as the Rails community comes to depend on it. (Ask me how I know! :sob:) The Nokogiri and Loofah maintainers will experience burnout when their sphere of responsibility doesn't match their sphere of control -- specifically when users loudly complain about something not working perfectly across the integration boundaries.

One reason to not merge might be the size and complexity of the codebase increases for both teams in a merger. However, I think we'd more than offset or mitigate that complexity when we plan to reduce the total support burden by: a) improving the installation experience, b) improving the developer experience, c) removing a vector (libxml2's HTML parser) for security vulnerabilities, d) replacing the informal brittle install-time contract with a robust compile-time contract, e) having a unified team large enough for folks to still specialize within the codebase.

[1]: Nokogumbo's installation process is still pretty tightly-coupled to Nokogiri's installation process, and there's an implied contract between the two that involves the structure of Nokogiri::VersionInfo and the presence of header files on disk. So we'd need to tighten up that contract, and ensure we're integration-testing across the two projects.

Loofah and Nokogiri will be updated to use Nokogumbo (only on CRuby), making Nokogumbo a dependency of Rails. Nokogumbo support burden would likely increase as a thundering herd of Rails developers are now installing Nokogumbo.

I expect some demand for a faster/better installation experience for Nokogumbo, similar to what Nokogiri has seen over the past few years. Work to keep the support burden low by investing in a single installer toolchain, and a unified memory testing approach (today Nokogiri uses valgrind, but I'm looking into ASan). We'd likely spend some time re-implementing the toolchain and keeping it in sync across the two projects. This doesn't sound like fun to me, but we could do it.

rubys commented 4 years ago

One gem would also mean that nokogumbo can drop the "no headers available" fallback.

You don't need my permission to pursue the "one gem" approach, but you have it.

tenderlove commented 4 years ago

I personally like 3b. I think it would make things easier to maintain.

rubys commented 4 years ago

One possible change to consider for 2020: make Nokogiri::HTML4 an alias for Nokogiri::HTML, and keep the HTML4 logic in place for a transitional period once HTML5 logic is introduced.

stevecheckoway commented 4 years ago

I'm definitely on board with having a single gem that works with HTML.

One thing that I hadn't fully appreciated earlier was that the HTML living standard isn't designed to be a separate standard. Browsers don't have multiple parsers, they all just parse HTML and this specifies how to do it.

Currently, Nokogumbo will refuse to parse any html that doesn't start with <!DOCTYPE html>. This includes parsing any HTML that has a DOCTYPE specifying HTML 4. I suspect we could (and should) lift that restriction, either in Nokogumbo or in Nokogiri.

The major concern I have with doing that comes with serialization. For example, it should be no problem to parse HTML with

 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">

(and indeed gumbo supports it). But when it comes to serializing it, if we follow the Serializing HTML fragments algorithm, the DOCTYPE will become <!DOCTYPE html>.¹. This might surprise users and may break their use case. If we don't follow the algorithm and instead write out the full DOCTYPE, will the result actually be valid according to the DTD? I have no idea.

One additional concern, libxml2 will insert nodes into the DOM that aren't specified by the standard. I forget exactly what it is off-hand, but I think it might insert a meta specifying the character set. Would we want to maintain such quirks for backwards compatibility?

¹ The algorithm for serializing a DocumentType only specifies writing out the name and not the publicId or systemId.

rubys commented 4 years ago

Currently, Nokogumbo will refuse to parse any html that doesn't start with <!DOCTYPE html>.

Please explain. Here is what I'm seeing:

$ irb
irb(main):001:0> require 'nokogumbo'
=> true
irb(main):002:0> Nokogiri::HTML5('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org
TR/html4/loose.dtd">foo')
=> #<Nokogiri::HTML5::Document:0x118 name="document" children=[#<Nokogiri::XML::DTD:0xb4 name="html">, #<Nokogiri::XML::Element:0x104 name="html" children=[#<Nokogiri::XML::Element:0xc8 name="head">, #<Nokogiri::XML::Element:0xf0 name="body" children=[#<Nokogiri::XML::Text:0xdc "foo">]>]>]>
irb(main):003:0>

stevecheckoway commented 4 years ago

Oh! I was confusing that with this.

flavorjones commented 3 years ago

Closing, please follow #2204 for the exciting conclusion.

ilyazub commented 1 year ago

Even though this issue is closed, I still want to mention the faster libxml2 and gumbo alternative^[1] that we at SerpApi have been using in production for a year — the Lexbor^[2] library via the Nokolexbor gem^[3].

The Nokogumbo merge into Nokogiri was an epic work.

@flavorjones What do you think about using Nokolexbor for the HTML processing in Nokogiri?

flavorjones commented 12 months ago

@ilyazub please start a new issue.

sparklemotion / nokogiri