serpapi / nokolexbor

High-performance HTML5 parser for Ruby based on Lexbor, with support for both CSS selectors and XPath.
182 stars 4 forks source link

Low-level API for creating a document/fragment without parsing an HTML string #9

Closed joeldrapper closed 1 year ago

joeldrapper commented 1 year ago

I maintain a Ruby-based view component library called Phlex, which takes a Ruby structure and turns it into an HTML String. I’m wondering if it might be possible (and reasonable) to alternatively turn it into a Nokolexbor syntax tree directly, skipping the HTML rendering and parsing steps for performance.

Basically, instead of returning an HTML string, a Phlex component could optionally return a Nokolexbor::DocumentFragment, which could then be used for testing or further DOM manipulation.

It may not work since we’d have to spend a lot of time in Ruby land calling lots of Ruby methods to build this document (they’d have to be faster than String#<< for this to make sense). And if it does work, it may not be worth it since it sounds like Nokolexbor is already really fast at parsing HTML. I thought it might be an interesting idea to explore anyway.

What are your thoughts?

zyc9012 commented 1 year ago

@joeldrapper If you mean to build a Nokolexbor::DocumentFragment by manually creating elements. It's already supported. Example:

doc = Nokolexbor::Document.new
frag = Nokolexbor::DocumentFragment.new(doc)
div = doc.create_element('div', "Content", { class: 'a b c', style: 'e f g' })
div << doc.create_element('a', "Link text", { href: 'https://www.google.com' })
div << doc.create_element('span', "xxx", { class: 'some_class' })
frag << div

I also benchmarked creating Nokolexbor::DocumentFragment by hand and by parsing. But it seems the latter is the absolute winner.

require 'benchmark/ips'
require 'nokolexbor'

def create_fragment_by_hand
  doc = Nokolexbor::Document.new
  frag = Nokolexbor::DocumentFragment.new(doc)
  (1..50).each do |i|
    div = doc.create_element('div', "Content #{i}", { class: 'a b c', style: 'e f g' })
    div << doc.create_element('a', "Link text", { href: 'https://www.google.com' })
    (1..50).each do |j|
      div << doc.create_element('span', j, { class: 'some_class' })
    end
    frag << div
  end
  frag
end

@html = create_fragment_by_hand.to_html

def create_fragment_by_parsing
  doc = Nokolexbor::Document.new
  frag = doc.fragment(@html)
end

raise "HTML output not equal" if create_fragment_by_parsing.to_html != create_fragment_by_hand.to_html

Benchmark.ips do |x|
  x.warmup = 2
  x.time = 10

  x.report("Create fragment by hand") do
    create_fragment_by_hand
  end
  x.report("Create fragment by parsing") do
    create_fragment_by_parsing
  end
  x.compare!
end

Output

Warming up --------------------------------------
Create fragment by hand
                        20.000  i/100ms
Create fragment by parsing
                        93.000  i/100ms
Calculating -------------------------------------
Create fragment by hand
                        293.214  (± 5.8%) i/s -      2.940k in  10.063928s
Create fragment by parsing
                          1.257k (±18.0%) i/s -     12.183k in  10.087883s

Comparison:
Create fragment by parsing:     1257.4 i/s
Create fragment by hand:      293.2 i/s - 4.29x  slower

However, the benchmarks may not be very helpful because you have to do String#<< first to build the HTML which takes time. So you'd better do benchmarks on your specific scenario (create by hand vs. build HTML + parse).

lexborisov commented 1 year ago

@zyc9012

In C, the situation is the opposite. It is much faster to create a tree manually than to parse it. Lot of time is spent on processing the bindings for the C functions.

joeldrapper commented 1 year ago

I wondered if there might be a way for Phlex to create the primitive data structures for the tree directly, without going through the normal DOM manipulation API. Phlex templates are interpreted like a tree already, so there’s no need for tokenising or parsing steps.

zyc9012 commented 1 year ago

I wondered if there might be a way for Phlex to create the primitive data structures for the tree directly, without going through the normal DOM manipulation API. Phlex templates are interpreted like a tree already, so there’s no need for tokenising or parsing steps.

Ideally, do only one call to get the desired Nokolexbor::DocumentFragment. In this case I'm afraid you might need to customize the C part. You pass the Phlex structure into C extension, iterate your structure in C, and create the corresponding Lexbor structure. But the iteration will be calling ruby C API, not sure how the performance will be.

joeldrapper commented 1 year ago

That makes sense. It sounds like this is not the low-hanging fruit I thought it could be.

Thanks for your help. Nokolexbor is awesome.