Nokogiri should use Gumbo/HTML5 by default on supported platforms

flavorjones commented 3 years ago

This issue is a placeholder for work to be done to use the HTML5 parsing engine by default on platforms where it's supported (meaning, not-JRuby).

Specifically this probably means that when the HTML5 module exists ...

Nokogiri::HTML() should proxy to Nokogiri::HTML5()
Nokogiri.parse() should proxy to Nokogiri::HTML5.parse()
bin/nokogiri should support html4 and html5 options, and the existing html option should proxy to html5

There may be other behaviors we want to switch to the HTML5 parser as well.

Let's also please make sure to do some benchmarks before changing the default behavior. In particular this would be document and fragment parsing, as well as any CSS selectors that are conditionally translated (see #2376).

Pre-work:

flavorjones commented 2 years ago

Moving this out one release to make sure the CSS query hacking is stable.

flavorjones commented 2 years ago

A baby step we'll do first is: support subclassing in v1.14.0 to enable Loofah and Rails::Html::Sanitizer to default to HTML5:

So I'm pushing this out to v1.15.0

flavorjones commented 2 years ago

See https://github.com/sparklemotion/nokogiri/issues/2569 for one performance concern that we should benchmark and address.

Update: it's been addressed.

flavorjones commented 1 year ago

It may be worth testing upstream Capybara before releasing this change, as Capybara has some logic for toggling between HTML4 and HTML5 parsers already.

zyc9012 commented 1 year ago

Nokogiri::HTML5 is slow on initialization as well.

[29] pry(main)> html = File.read('big_shopping.html')
[30] pry(main)> Nokogiri::VERSION
=> "1.13.9"
[31] pry(main)> Benchmark.ms { Nokogiri::HTML4(html) }
=> 54.85699977725744
[32] pry(main)> Benchmark.ms { Nokogiri::HTML5(html) }
=> 227.6209993287921

Tested HTML: big_shopping.html.zip

stevecheckoway commented 1 year ago

We should profile and figure out where it is spending the majority of its time.

One thing that could probably be improved is turning the part of the state machine that reads characters into a buffer into a loop rather than needing to do the state machine dispatch per character.

This potentially includes reading tag names and just reading text.

The state machine is complicated though and we’d need to look at the tests closely to make sure they adequately cover such a change.

flavorjones commented 1 year ago

@stevecheckoway here's some profiling info on what the parser is doing when parsing big_shopping.html (generated with gperftools cpu profiler):

      93  11.7%  11.7%       93  11.7% decode (inline)
      81  10.2%  21.9%      640  80.4% gumbo_parse_with_options
      74   9.3%  31.2%       75   9.4% pthread_attr_setschedparam
      56   7.0%  38.2%       56   7.0% _init@3e000
      41   5.2%  43.3%      331  41.6% gumbo_lex
      40   5.0%  48.4%      135  17.0% read_char
      35   4.4%  52.8%       52   6.5% gumbo_string_buffer_append_codepoint
      33   4.1%  56.9%      181  22.7% handle_token (inline)
      28   3.5%  60.4%       28   3.5% get_current_node.isra.0
      20   2.5%  62.9%      149  18.7% finish_token.isra.0
      17   2.1%  65.1%       65   8.2% insert_text_token.isra.0
      16   2.0%  67.1%       16   2.0% get_adjusted_current_node
      15   1.9%  69.0%      170  21.4% emit_char
      14   1.8%  70.7%       14   1.8% atomic_sub_nounderflow (inline)
      14   1.8%  72.5%       14   1.8% xmlStrdup (inline)
      12   1.5%  74.0%       12   1.5% gumbo_tokenizer_set_is_adjusted_current_node_foreign
      11   1.4%  75.4%       14   1.8% handle_text
      10   1.3%  76.6%       17   2.1% maybe_resize_string_buffer
       9   1.1%  77.8%        9   1.1% update_position (inline)
       8   1.0%  78.8%       26   3.3% __libc_malloc
...

html5

I won't attempt to analyze that today, but the amount of time spent in pthread_attr_setschedparam makes me scratch my head, as does the _init entry. Anything in particular you want me to dig into?

stevecheckoway commented 1 year ago

Some of that is quite odd. The fact that gumbo_tokenizer_set_is_adjusted_current_node_foreign is showing up at all is surprising.

I just looked at the assembly expecting that the code would be essentially be a few loads and a store. This is the only line of the function that does anything:

  parser->_tokenizer_state->_is_adjusted_current_node_foreign = is_foreign;

The assembly that is being emitted contains calls to the empty gumbo_debug function. I've got a simple fix for that:

diff --git a/gumbo-parser/src/util.c b/gumbo-parser/src/util.c
index d1ab2d7a..6238c296 100644
--- a/gumbo-parser/src/util.c
+++ b/gumbo-parser/src/util.c
@@ -63,6 +63,4 @@ void gumbo_debug(const char* format, ...) {
   va_end(args);
   fflush(stdout);
 }
-#else
-void gumbo_debug(const char* UNUSED_ARG(format), ...) {}
 #endif
diff --git a/gumbo-parser/src/util.h b/gumbo-parser/src/util.h
index dfdf465b..5c6ddd8c 100644
--- a/gumbo-parser/src/util.h
+++ b/gumbo-parser/src/util.h
@@ -21,7 +21,11 @@ void* gumbo_realloc(void* ptr, size_t size) RETURNS_NONNULL;
 void gumbo_free(void* ptr);

 // Debug wrapper for printf
+#ifdef GUMBO_DEBUG
 void gumbo_debug(const char* format, ...) PRINTF(1);
+#else
+static inline void gumbo_debug(const char* UNUSED_ARG(format), ...) PRINTF(1) {}
+#endif

 #ifdef __cplusplus
 }

After that change, the emitted code is

                     _gumbo_tokenizer_set_is_adjusted_current_node_foreign:
0000000000137474         ldr        x8, [x0, #0x10]                             ; CODE XREF=_gumbo_parse_with_options+908
0000000000137478         strb       w1, [x8, #0x5]
000000000013747c         ret

But even that could be inlined with link-time optimization. (In fact, doing a link time optimization and also explicitly setting the symbols that are exported from the nokogiri.{bundle,so,dll} is likely to have a measurable performance win. Setting which symbols are exported should speed up program startup times in particular.

Those two seem like easy wins although setting which symbols are exported may impact downstream projects that rely on the Nokogiri C extension's symbols (like Nokogumbo did). Unfortunately, the 2-pass procedure where we first build a DOM tree using Gumbo's data structures and then build a DOM tree using libxml2's data structures does mean we have some essentially unavoidable overhead. Maybe gumbo could be modified to build a libxml2 DOM itself. That's likely a significant undertaking.

flavorjones commented 1 year ago

I'm going to open up a new issue to dive into some these optimizations, since this issue is related but not specific to performance. Let's move the performance conversation to #2722.

flavorjones commented 3 months ago

Setting this to the v2.0 milestone since it's likely to be a breaking change for some folks.

sparklemotion / nokogiri

Nokogiri should use Gumbo/HTML5 by default on supported platforms #2331

2722