rubys / nokogumbo

A Nokogiri interface to the Gumbo HTML5 parser.
Apache License 2.0
186 stars 114 forks source link

Performance regression in 2.0.0 #145

Closed dan42 closed 2 years ago

dan42 commented 4 years ago

document.to_s is 10x slower compared to 1.5.0

Some data:

nokogiri nokogumbo parse        dup             to_s            line (using libxml2)
1.10.9   2.0.2     0.02272594   0.004979629 0.11419759  65
1.8.5    2.0.2     0.022512514  0.004959403 0.113409996 65
1.10.9   2.0.1     0.023012862  0.004949328 0.114104255 65
1.8.5    2.0.1     0.02287916   0.00503443  0.11565286  65
1.10.9   2.0.0     0.022793292  0.004009529 0.11148991  0
1.8.5    2.0.0     0.022858597  0.005029097 0.114100392 0
1.10.9   1.5.0     0.0256769    0.005022382 0.011504857 0
1.8.5    1.5.0     0.025525698  0.005028092 0.01120582  0
1.8.3    1.5.0     0.025516003  0.005081101 0.011363533 0
stevecheckoway commented 4 years ago

I'm not terribly surprised by that. In 2.0.0, I added support for proper HTML serialization rather than relying on libxml2 which does it incorrectly. (That's not surprising, libxml2 doesn't handle modern HTML which is the whole reason for using Nokogumbo.)

If you can live with the incorrect but faster serialization, you can probably call the Nokogiri::HTML::Document#to_s method instead. (But my Ruby-fu is weak, so I don't know how to do that off-hand.)

The serialization code could be rewritten in C, I suppose.

dan42 commented 4 years ago

Thanks for the tip. Looks like I can revert to the fast serialization with something along the lines of Nokogiri::HTML5::Node.remove_method(:write_to)

There's already code in write_to that picks Nokogiri's native_write_to in the case of xml/xhtml serialization; it might be nice to add some option to force that mode even for html serialization.

stevecheckoway commented 4 years ago

As far as I know, its xml serialization is correct so it didn't make sense to duplicate that for xhtml output.

I'm hesitant to add functionality that has incorrect behavior. Maybe an option would be acceptable here. I don't think it'd work with #to_s though as that doesn't take options.

dan42 commented 4 years ago

You're right, it wouldn't work with #to_s, you'd need to use the option with #to_html which does pass its options all the way down to #write_to.

I understand your hesitance but I think a 10x difference in performance is enough to warrant a workaround. At least for me there's no way I can use the "correct" behavior as the increase in load is enough to kill my servers. Although I have to ask what is so different about the correct serialization. The only incorrectness I ever noticed in libxml serialization is that it will close certain tags that should be void, like <wbr></wbr>. A harmless closing tag that that will be ignored by the browser. And almost nobody uses that tag anyway.

stevecheckoway commented 4 years ago

From a simple test, it screws up embed, source, track, and wbr elements by inserting an end tag. It also adds a meta tag when it definitely should not.

It also mishandles pre by stripping newlines (actually, this behavior comes from Gumbo because that's what the spec says to do; libxml2 is just not adding an additional newline when required. In fact, Nokogumbo's implementation does the same by default. This is probably a bad default, but it matches the HTML fragment serialization algorithm and can be overridden by passing preserve_newline: true to #serialize).

And finally, iirc, it doesn't handle serializing HTML fragments correctly at all; although I forget the details.

Since you have a work around, my preference is to not expose libxml2's serialization and instead, write a more performant but correct serialization function in C. But before that, I guess I should profile the ruby code and see if there's an obvious inefficiency I can correct.

dan42 commented 4 years ago

Thank you for the informative answer. I agree it's better here to aim for a correct fix than a quick hacky fix. The remove_method hack is a good enough workaround in the meantime.

stevecheckoway commented 4 years ago

I'm going to leave this open, if it's alright with you, so that I don't forget about it.

dan42 commented 4 years ago

Ok, so it turns out that my workaround has a fatal flaw. <svg> elements are created with a 'svg' xmlns. So this <svg role="img" class="icon"><use xlink:href="/url"></use></svg> is serialized as <svg:svg role="img" class="icon"><svg:use xlink:href="/url"></svg:use></svg:svg> which unsurprisingly doesn't work in browsers.

This wasn't the case in 1.5.0; I wonder what's the point of creating nodes with a namespace since it's all discarded later in serialize_node_internal?

stevecheckoway commented 4 years ago

The point is that nokogumbo does more than read in text and spit out text. It builds the DOM which is accessible via the XPath which should have foreign nodes with namespaces.

Furthermore, having proper support for namespaces means nokogumbo gets to use html5lib-tests to test its DOM construction works correctly. See, for example, https://github.com/html5lib/html5lib-tests/blob/master/tree-construction/webkit02.dat#L270.

As I've said before, using the HTML 4 serializer simply does the wrong thing in several places. I'm not surprised serializing foreign elements is one of them.

dan42 commented 4 years ago

Ah, interesting. So you're saying that doc.xpath('//svg:svg') is the correct way of finding svg elements.

I fully understand that Nokogumbo is doing everything right as per the html5 spec. I just wanted to point out to anyone reading this ticket that the previously described workaround/hack doesn't work.

flavorjones commented 2 years ago

@stevecheckoway Are you still concerned about performance of serialization of HTML5 documents? If so, I'll move this issue over to Nokogiri so we can track it as we consider whether HTML5 should be the default in a future release.

stevecheckoway commented 2 years ago

I am. I think we should probably write the serialization code in C as it is pretty slow in Ruby. It's not particularly difficult to write. I just never got around to it.

flavorjones commented 2 years ago

I'm moving this issue over to https://github.com/sparklemotion/nokogiri/issues/2569