serpapi / nokolexbor

High-performance HTML5 parser for Ruby based on Lexbor, with support for both CSS selectors and XPath.
182 stars 4 forks source link

7x slower than Nokogiri with small documents #10

Closed joelmoss closed 11 months ago

joelmoss commented 1 year ago

Not sure if I've missed something here, but even though Nokolexbor is faster with large documents, when you give it something simple, it's actually quite a lot slower. in the blow case, it's over 7x slower!

ruby 3.2.2 (2023-03-30 revision e51014f9c0) +YJIT [arm64-darwin22]
Warming up --------------------------------------
    Nokolexbor parse     2.064k i/100ms
      Nokogiri parse    13.850k i/100ms
Calculating -------------------------------------
    Nokolexbor parse     18.963k (± 5.8%) i/s -    379.776k in  20.097018s
      Nokogiri parse    139.865k (± 3.9%) i/s -      2.798M in  20.032824s

Comparison:
      Nokogiri parse:   139864.8 i/s
    Nokolexbor parse:    18963.3 i/s - 7.38x  slower
content = %(<h1 class="hello">Hello World</h1>)

Benchmark.ips do |x|
  x.warmup = 5
  x.time = 20

  x.report('Nokolexbor parse') do
    Nokolexbor::HTML(content)
  end
  x.report('Nokogiri parse') do
    Nokogiri::HTML(content)
  end
  x.compare!
end

Is there a reason for this? and is there anything we can do to speed it up? I would hate to have to use both libs - one for small docs and the other for larger ones.

thx

lexborisov commented 1 year ago

Hi @joelmoss

I really like benchmarks. Especially when they compare a house to a car.

Let's definitions right away. For a long time in the world there is HTML5 specification - it is a living standard, in fact there is no such HTML5, there is a modern living standard. It "understands" HTML4 if to put it crudely.

Next, you use Nokogiri::HTML for comparison. Claimed to be an HTML4 standard is fundamentally wrong, there is no HTML4 for a long time. Moreover, Nokogiri::HTML uses libxml which does not conform to any standard.

If I were you, I would compare it to Nokogiri::HTML5 (Gumbo). In other words, you are comparing things that are incomparable.

The lexbor maintains a living standard.

joelmoss commented 1 year ago

I took the benchmark code straight from https://github.com/serpapi/nokolexbor/blob/master/bench/bench.rb#L23-L28 🤔

So is that wrong too then?

lexborisov commented 1 year ago

@joelmoss

So is that wrong too then?

Yes, this is not the right benchmark, you should use Nokogiri::HTML5. You will see a significant drawdown. Nokogiri::HTML5 is a Gumbo that is designed according to the HTML5 specification (it seems like it hasn't been updated in a very long time). Nokogiri::HTML is libxml, which is just not worth comparing. libxml does not conform to the standard.

I tried your test in C. The lexbor processes 2,100,000 in 20 seconds. Nokogiri::HTML5:

Warming up --------------------------------------
      Nokogiri parse     8.150k i/100ms
Calculating -------------------------------------
      Nokogiri parse     79.390k (± 6.5%) i/s -      1.581M in  20.004949s

I cannot be responsible for the realization of nokolexbor.

joelmoss commented 1 year ago

Well now I'm just confused... 🤷‍♂️

ruby 3.2.2 (2023-03-30 revision e51014f9c0) +YJIT [arm64-darwin22]
Warming up --------------------------------------
Nokolexbor::HTML parse
                         1.938k i/100ms
Nokogiri::HTML parse    13.643k i/100ms
Nokogiri::HTML5 parse
                        12.865k i/100ms
Calculating -------------------------------------
Nokolexbor::HTML parse
                         19.498k (± 5.9%) i/s -    389.538k in  20.048820s
Nokogiri::HTML parse    137.940k (± 4.3%) i/s -      2.756M in  20.015707s
Nokogiri::HTML5 parse
                        127.211k (± 2.9%) i/s -      2.547M in  20.040877s

Comparison:
Nokogiri::HTML parse:   137940.5 i/s
Nokogiri::HTML5 parse:   127211.2 i/s - 1.08x  slower
Nokolexbor::HTML parse:    19497.9 i/s - 7.07x  slower
content = %(<h1 class="hello">Hello World</h1>)
Benchmark.ips do |x|
  x.warmup = 5
  x.time = 20
  x.report('Nokolexbor::HTML parse') do
    Nokolexbor::HTML(content)
  end
  x.report('Nokogiri::HTML parse') do
    Nokogiri::HTML(content)
  end
  x.report('Nokogiri::HTML5 parse') do
    Nokogiri::HTML5(content)
  end
  x.compare!
end
lexborisov commented 1 year ago

@joelmoss

There's a question of implementing the Nokolexbor wrapper over the lexbor. On small pages (one tag) lexbor is either equal to the others or slightly faster. There's just nothing to compare. This is a comparison not of parsing, but of parser initialization, who is faster.

There's nothing to say about regular pages. The lexbor is much faster than the others.

joelmoss commented 1 year ago

I think you're mainly talking about lexbor, but I'm talking about nokolexbor, and comparing that to nokogiri. I'm using Ruby, and that means me needing to use nokolexbor. If I can use just lexbor in ruby, then great, but I'm pretty sure the only way to do that is to use nokolexbor.

So I can only compare nokolexbor to nokogiri, which the former appears to be slower with smaller documents. My initial reason for creating this issue was to try and find out why nokolexbor is slower that nokogiri for small docs, but faster with bigger docs.

Do you have any ideas why nokolexbor would be slower that nokogiri on small docs?

lexborisov commented 1 year ago

@joelmoss

I have looked at the code, and the author of this wrapper creates a parser to parsing each html. This is not correct.

Lexbor has two approaches:

  1. Create a document and call parsing (parser and tokenizer will be created for the document)
  2. Create a parser that will live forever (or you can kill it and work on the documents), and it will give documents when parsing.

I didn't bother digging further into the code.

joelmoss commented 1 year ago

ok, thx.

lexborisov commented 1 year ago

@joelmoss

Your welcome. And lastly, that's a very synthetic test you have. Regular HTML has 100+ tags at least.

joelmoss commented 1 year ago

It is, but most of these will be html fragments, and not always full html pages. Some some may be as small as a couple of elements.

lexborisov commented 1 year ago

@joelmoss

There is a special function for parsing HTML fragment for this purpose. It is described in the specification. There are specific parsing conditions. Here lexbor is implemented slow relatively. We'll have to rewrite it. In general, parsing a fragment is a separate function. Every parser should have it.

zyc9012 commented 1 year ago

Thanks @joelmoss for the feedback. I'll definitely look into it.

I have looked at the code, and the author of this wrapper creates a parser to parsing each html. This is not correct.

Lexbor has two approaches:

  1. Create a document and call parsing (parser and tokenizer will be created for the document)
  2. Create a parser that will live forever (or you can kill it and work on the documents), and it will give documents when parsing.

@lexborisov I think I did exactly the same as 1. Do you mean 2 is faster than 1?

https://github.com/serpapi/nokolexbor/blob/18f1c26fab54de8d6f850de827246e1c2c3fbb71/ext/nokolexbor/nl_document.c#L35-L62

lexborisov commented 1 year ago

Hi @zyc9012

Each call to lxb_html_document_parse() will create a parser and tokenizer. That's pretty wasteful. You can, perhaps even should, create a parser and use the lxb_html_parse() function - it returns the document. That is, the parser will be in memory all the time.

Although, in hindsight, it might not be worth doing anything. For @joelmoss needs we should use fragment parsing.

zyc9012 commented 1 year ago

Thanks @lexborisov

I've tested 2 but the result was pretty much the same.

From what I've experimented with so far, it seems lxb_html_document_create takes a large part of the total time for parsing a small document such as a single tag. It takes longer than the whole parsing process of libxml2 (haven't tested Gumbo). Maybe the multiple allocation calls are expensive?

For @joelmoss needs we should use fragment parsing.

If you are referring to lxb_html_document_parse_fragment, it still needs a lxb_html_document_t right? So we can't avoid creating a document.

lexborisov commented 1 year ago

@zyc9012

I've tested 2 but the result was pretty much the same.

On my tests, the speed is almost doubled. Which makes sense. But none of that matters. I would just ignore it.

Lexbor is now slower to process very small HTML because of the heavy initialization of the Document object. This can be fixed relatively easily, but I wouldn't pursue it right now. It seems that HTML with less than 10 tags is very rare. In other words, I don't see it as a problem.

zyc9012 commented 1 year ago

This can be fixed relatively easily, but I wouldn't pursue it right now. It seems that HTML with less than 10 tags is very rare. In other words, I don't see it as a problem.

@lexborisov I agree, We are really happy about the speed improvement from Lexbor. But since @joelmoss has brought up the question, I want to give it a try, please give insights on the possible way to fix it. I'll see if I can do something to patch it.

zyc9012 commented 1 year ago

Create a parser that will live forever (or you can kill it and work on the documents), and it will give documents when parsing.

@lexborisov Is it thread-safe to use a global lxb_html_parser_t for all the parsing jobs?

zyc9012 commented 1 year ago

Hi @joelmoss.

I've pushed a fix to master according to https://github.com/serpapi/nokolexbor/issues/10#issuecomment-1675223014, will you be able to check the performance?

On my local test, Nokolexbor is only slower when the HTML has less than 5 tags.

1 tag: Nokolexbor is 1.80x slower. 2 tags: Nokolexbor is 1.46x slower. 3 tags: Nokolexbor is 1.29x slower. 4 tags: Nokolexbor is 1.13x slower. 5 tags: Nokolexbor is 1.02x faster. 6 tags: Nokolexbor is 1.07x faster. 7 tags: Nokolexbor is 1.12x faster. 8 tags: Nokolexbor is 1.27x faster. 9 tags: Nokolexbor is 1.33x faster. 10 tags: Nokolexbor is 1.41x faster. 50 tags: Nokolexbor is 2.20x faster. 100 tags: Nokolexbor is 2.76x faster.

ilyazub commented 11 months ago

@nwellnhof recently improved the libxml2 parser performance

@zyc9012 Do you think it may speed up Nokolexbor when XPath is used?

nwellnhof commented 11 months ago

The changes to libxml2 mentioned above only apply to the XML parser. They don't affect HTML parsing or XPath processing.

zyc9012 commented 11 months ago

Released 0.5.2 and closing this based on https://github.com/serpapi/nokolexbor/issues/10#issuecomment-1701939446.