msva / lua-htmlparser

An HTML parser for lua.
231 stars 44 forks source link

Performance is not that great... ? #66

Closed ehsanghorbani190 closed 7 months ago

ehsanghorbani190 commented 7 months ago

Hi! We want to write a script which will crawl at least 1000 HTML pages, run some selects on their nodes and check some conditions. I wanted to write this script in Lua, but it seems this package for HTML parsing is much, much slower than node-html-parser for JS!

JS example:

var HTMLParser = require('node-html-parser');
const url = 'https://neshan.org/maps/bazaar/ahwaz-bazar-imam-khomeini-clothing';
fetch(url)
.then(data=> data.text())
.then(data => {
const start = performance.now()
for(let i =0; i< 1000;i++){
    const body = HTMLParser.parse(data);
    const wrapper = body.querySelector('.poi-box-wrapper')
    const boxes = wrapper.querySelectorAll('.poi-box')
}
const total_time = performance.now() - start;
console.log(`it took ${total_time.toFixed(5)}ms in total, making it ${(total_time/1000).toFixed(5)}ms for average`)
})

results using node v21.6.1:

it took 2650.08149ms in total, making it 10.60033ms for average

Lua example:

local http = require('ssl.https')
local htmlparser = require("htmlparser")
local url = 'https://neshan.org/maps/bazaar/ahwaz-bazar-imam-khomeini-clothing'
local body = http.request(url)
local start = os.clock()
for i=1,1000 do
    local root = htmlparser.parse(body)
    local wrapper = root('.poi-box-wrapper')
    local subs = wrapper[1]('.poi-box')
end
local elapsed_time = (os.clock() - start)*1000
print('it took '..elapsed_time..'ms in total, making it '.. elapsed_time/1000 .. 'ms for average')

results using Lua v5.1.5:

it took 41647.347ms in total, making it 41.647347ms for average

It about 16 times slower! Am I doing something wrong? Is my example ok?

UPDATE: I tested it with LuaJIT v2.1.1706185428, and results are better than Lua but not better than JS:

it took 23525.743ms in total, making it 23.525743ms for average
msva commented 7 months ago

The purpose of https://github.com/taoqf/node-html-parser is to be "fastest parser at any cost" (including resource consumption and correctness, btw). The purpose of this project - is to be no-deps pure-lua parser implementation.

There are another projects exists, that are not limited by that restrictions and works much faster (but has another consequences). For example, there is a library called Gumbo. It is worth to mention, that Gumbo also don't focus on speed (but focus on correctness IIRC), but as it is compiled-from-C library, and not a pure-lua, it in no doubt faster than this project.

So, TL;DR: different use-cases have different requirements and there are different tools to meet that requirements.