Closed joelmoss closed 11 months ago
Hi @joelmoss
I really like benchmarks. Especially when they compare a house to a car.
Let's definitions right away. For a long time in the world there is HTML5 specification - it is a living standard, in fact there is no such HTML5, there is a modern living standard. It "understands" HTML4 if to put it crudely.
Next, you use Nokogiri::HTML
for comparison.
Claimed to be an HTML4 standard is fundamentally wrong, there is no HTML4 for a long time. Moreover, Nokogiri::HTML
uses libxml
which does not conform to any standard.
If I were you, I would compare it to Nokogiri::HTML5
(Gumbo).
In other words, you are comparing things that are incomparable.
The lexbor
maintains a living standard.
I took the benchmark code straight from https://github.com/serpapi/nokolexbor/blob/master/bench/bench.rb#L23-L28 🤔
So is that wrong too then?
@joelmoss
So is that wrong too then?
Yes, this is not the right benchmark, you should use Nokogiri::HTML5
.
You will see a significant drawdown. Nokogiri::HTML5 is a Gumbo
that is designed according to the HTML5 specification (it seems like it hasn't been updated in a very long time).
Nokogiri::HTML is libxml
, which is just not worth comparing. libxml does not conform to the standard.
I tried your test in C. The lexbor
processes 2,100,000 in 20 seconds.
Nokogiri::HTML5
:
Warming up --------------------------------------
Nokogiri parse 8.150k i/100ms
Calculating -------------------------------------
Nokogiri parse 79.390k (± 6.5%) i/s - 1.581M in 20.004949s
I cannot be responsible for the realization of nokolexbor
.
Well now I'm just confused... 🤷♂️
ruby 3.2.2 (2023-03-30 revision e51014f9c0) +YJIT [arm64-darwin22]
Warming up --------------------------------------
Nokolexbor::HTML parse
1.938k i/100ms
Nokogiri::HTML parse 13.643k i/100ms
Nokogiri::HTML5 parse
12.865k i/100ms
Calculating -------------------------------------
Nokolexbor::HTML parse
19.498k (± 5.9%) i/s - 389.538k in 20.048820s
Nokogiri::HTML parse 137.940k (± 4.3%) i/s - 2.756M in 20.015707s
Nokogiri::HTML5 parse
127.211k (± 2.9%) i/s - 2.547M in 20.040877s
Comparison:
Nokogiri::HTML parse: 137940.5 i/s
Nokogiri::HTML5 parse: 127211.2 i/s - 1.08x slower
Nokolexbor::HTML parse: 19497.9 i/s - 7.07x slower
content = %(<h1 class="hello">Hello World</h1>)
Benchmark.ips do |x|
x.warmup = 5
x.time = 20
x.report('Nokolexbor::HTML parse') do
Nokolexbor::HTML(content)
end
x.report('Nokogiri::HTML parse') do
Nokogiri::HTML(content)
end
x.report('Nokogiri::HTML5 parse') do
Nokogiri::HTML5(content)
end
x.compare!
end
@joelmoss
There's a question of implementing the Nokolexbor
wrapper over the lexbor
.
On small pages (one tag) lexbor
is either equal to the others or slightly faster.
There's just nothing to compare. This is a comparison not of parsing, but of parser initialization, who is faster.
There's nothing to say about regular pages. The lexbor
is much faster than the others.
I think you're mainly talking about lexbor, but I'm talking about nokolexbor, and comparing that to nokogiri. I'm using Ruby, and that means me needing to use nokolexbor. If I can use just lexbor in ruby, then great, but I'm pretty sure the only way to do that is to use nokolexbor.
So I can only compare nokolexbor to nokogiri, which the former appears to be slower with smaller documents. My initial reason for creating this issue was to try and find out why nokolexbor is slower that nokogiri for small docs, but faster with bigger docs.
Do you have any ideas why nokolexbor would be slower that nokogiri on small docs?
@joelmoss
I have looked at the code, and the author of this wrapper creates a parser to parsing each html. This is not correct.
Lexbor has two approaches:
I didn't bother digging further into the code.
ok, thx.
@joelmoss
Your welcome. And lastly, that's a very synthetic test you have. Regular HTML has 100+ tags at least.
It is, but most of these will be html fragments, and not always full html pages. Some some may be as small as a couple of elements.
@joelmoss
There is a special function for parsing HTML fragment for this purpose. It is described in the specification. There are specific parsing conditions. Here lexbor is implemented slow relatively. We'll have to rewrite it. In general, parsing a fragment is a separate function. Every parser should have it.
Thanks @joelmoss for the feedback. I'll definitely look into it.
I have looked at the code, and the author of this wrapper creates a parser to parsing each html. This is not correct.
Lexbor has two approaches:
- Create a document and call parsing (parser and tokenizer will be created for the document)
- Create a parser that will live forever (or you can kill it and work on the documents), and it will give documents when parsing.
@lexborisov I think I did exactly the same as 1. Do you mean 2 is faster than 1?
Hi @zyc9012
Each call to lxb_html_document_parse() will create a parser and tokenizer. That's pretty wasteful. You can, perhaps even should, create a parser and use the lxb_html_parse() function - it returns the document. That is, the parser will be in memory all the time.
Although, in hindsight, it might not be worth doing anything. For @joelmoss needs we should use fragment parsing.
Thanks @lexborisov
I've tested 2 but the result was pretty much the same.
From what I've experimented with so far, it seems lxb_html_document_create
takes a large part of the total time for parsing a small document such as a single tag. It takes longer than the whole parsing process of libxml2 (haven't tested Gumbo). Maybe the multiple allocation calls are expensive?
For @joelmoss needs we should use fragment parsing.
If you are referring to lxb_html_document_parse_fragment
, it still needs a lxb_html_document_t
right? So we can't avoid creating a document.
@zyc9012
I've tested 2 but the result was pretty much the same.
On my tests, the speed is almost doubled. Which makes sense. But none of that matters. I would just ignore it.
Lexbor is now slower to process very small HTML because of the heavy initialization of the Document object. This can be fixed relatively easily, but I wouldn't pursue it right now. It seems that HTML with less than 10 tags is very rare. In other words, I don't see it as a problem.
This can be fixed relatively easily, but I wouldn't pursue it right now. It seems that HTML with less than 10 tags is very rare. In other words, I don't see it as a problem.
@lexborisov I agree, We are really happy about the speed improvement from Lexbor. But since @joelmoss has brought up the question, I want to give it a try, please give insights on the possible way to fix it. I'll see if I can do something to patch it.
Create a parser that will live forever (or you can kill it and work on the documents), and it will give documents when parsing.
@lexborisov Is it thread-safe to use a global lxb_html_parser_t
for all the parsing jobs?
Hi @joelmoss.
I've pushed a fix to master
according to https://github.com/serpapi/nokolexbor/issues/10#issuecomment-1675223014, will you be able to check the performance?
On my local test, Nokolexbor is only slower when the HTML has less than 5 tags.
1 tag: Nokolexbor is 1.80x slower. 2 tags: Nokolexbor is 1.46x slower. 3 tags: Nokolexbor is 1.29x slower. 4 tags: Nokolexbor is 1.13x slower. 5 tags: Nokolexbor is 1.02x faster. 6 tags: Nokolexbor is 1.07x faster. 7 tags: Nokolexbor is 1.12x faster. 8 tags: Nokolexbor is 1.27x faster. 9 tags: Nokolexbor is 1.33x faster. 10 tags: Nokolexbor is 1.41x faster. 50 tags: Nokolexbor is 2.20x faster. 100 tags: Nokolexbor is 2.76x faster.
@nwellnhof recently improved the libxml2 parser performance
@zyc9012 Do you think it may speed up Nokolexbor when XPath is used?
The changes to libxml2 mentioned above only apply to the XML parser. They don't affect HTML parsing or XPath processing.
Released 0.5.2 and closing this based on https://github.com/serpapi/nokolexbor/issues/10#issuecomment-1701939446.
Not sure if I've missed something here, but even though Nokolexbor is faster with large documents, when you give it something simple, it's actually quite a lot slower. in the blow case, it's over 7x slower!
Is there a reason for this? and is there anything we can do to speed it up? I would hate to have to use both libs - one for small docs and the other for larger ones.
thx