serpapi / nokolexbor

High-performance HTML5 parser for Ruby based on Lexbor, with support for both CSS selectors and XPath.
182 stars 4 forks source link
c-extension css html5 parser ruby serpapi web-scraping xpath

Nokolexbor

CI

Nokolexbor is a drop-in replacement for Nokogiri. It's 5.2x faster at parsing HTML and up to 997x faster at CSS selectors.

It's a performance-focused HTML5 parser for Ruby based on Lexbor. It supports both CSS selectors and XPath. Nokolexbor's API is designed to be 1:1 compatible as much as possible with Nokogiri's API.

Requirements

Nokolexbor is shipped with pre-compiled gems on most common platforms:

If you are on a supported platform, just jump to the Installation section. Otherwise, you need to install CMake to compile C extensions:

macOS

brew install cmake

Linux (Debian, Ubuntu, etc.)

sudo apt-get install cmake

Installation

Add to your Gemfile:

gem 'nokolexbor'

Then, run bundle install.

Or, install the gem directly:

gem install nokolexbor

Quick start

require 'nokolexbor'
require 'open-uri'

# Parse HTML document
doc = Nokolexbor::HTML(URI.open('https://github.com/serpapi/nokolexbor'))

# Search for nodes by css
doc.css('#readme h1', 'article h2', 'p[dir=auto]').each do |node|
  puts node.content
end

# Search for text nodes by css
doc.css('#readme p > ::text').each do |text|
  puts text.content
end

# Search for nodes by xpath
doc.xpath('//div[@id="readme"]//h1', '//article//h2').each do |node|
  puts node.content
end

Features

Searching methods overview

Different behaviors from Nokogiri

Benchmarks

Benchmark parsing google result page (368 KB) and selecting nodes using CSS and XPath. Run on MacBook Pro (2019) 2.3 GHz 8-Core Intel Core i9.

Run with: ruby bench/bench.rb

Nokolexbor (iters/s) Nokogiri (iters/s) Diff
parsing 487.6 93.5 5.22x faster
at_css 50798.8 50.9 997.87x faster
css 7437.6 52.3 142.11x faster
at_xpath 57.077 53.176 same-ish
xpath 51.523 58.438 same-ish
Raw data ``` Warming up -------------------------------------- Nokolexbor parse 56.000 i/100ms Nokogiri parse 8.000 i/100ms Calculating ------------------------------------- Nokolexbor parse 487.564 (±10.9%) i/s - 9.688k in 20.117173s Nokogiri parse 93.470 (±21.4%) i/s - 1.736k in 20.024163s Comparison: Nokolexbor parse: 487.6 i/s Nokogiri parse: 93.5 i/s - 5.22x (± 0.00) slower Warming up -------------------------------------- Nokolexbor at_css 5.548k i/100ms Nokogiri at_css 6.000 i/100ms Calculating ------------------------------------- Nokolexbor at_css 50.799k (±13.8%) i/s - 987.544k in 20.018481s Nokogiri at_css 50.907 (±35.4%) i/s - 828.000 in 20.666258s Comparison: Nokolexbor at_css: 50798.8 i/s Nokogiri at_css: 50.9 i/s - 997.87x (± 0.00) slower Warming up -------------------------------------- Nokolexbor css 709.000 i/100ms Nokogiri css 4.000 i/100ms Calculating ------------------------------------- Nokolexbor css 7.438k (±14.7%) i/s - 145.345k in 20.083833s Nokogiri css 52.338 (±36.3%) i/s - 816.000 in 20.042053s Comparison: Nokolexbor css: 7437.6 i/s Nokogiri css: 52.3 i/s - 142.11x (± 0.00) slower Warming up -------------------------------------- Nokolexbor at_xpath 2.000 i/100ms Nokogiri at_xpath 4.000 i/100ms Calculating ------------------------------------- Nokolexbor at_xpath 57.077 (±31.5%) i/s - 920.000 in 20.156393s Nokogiri at_xpath 53.176 (±35.7%) i/s - 876.000 in 20.036717s Comparison: Nokolexbor at_xpath: 57.1 i/s Nokogiri at_xpath: 53.2 i/s - same-ish: difference falls within error Warming up -------------------------------------- Nokolexbor xpath 3.000 i/100ms Nokogiri xpath 3.000 i/100ms Calculating ------------------------------------- Nokolexbor xpath 51.523 (±31.1%) i/s - 903.000 in 20.102568s Nokogiri xpath 58.438 (±35.9%) i/s - 852.000 in 20.001408s Comparison: Nokogiri xpath: 58.4 i/s Nokolexbor xpath: 51.5 i/s - same-ish: difference falls within error ```