Open OEvgeny opened 8 months ago
Also look at this project https://github.com/GerHobbelt/gumbo-parser .
Thanks, for future travelers looking into this: there is a gumbo-parser
fork from above comment, and also there is the updated lua-specific version of the original google/gumbo-parser
repo developed independantly on Gitlab lua-gumbo/-/tree/master
From another issue discussion it seems the only way to use the lua-specific version would be to compile it into the redbean binary.
I've used gumbo for years now via Lua successfully. If you aren't aware, there's already a luarocks package https://github.com/craigbarnes/lua-gumbo
luarocks install gumbo
It's pretty battle tested by google, quick to compile, and while it might not be the fastest, it's fairly reliable and full featured. It lets you modify the html nodes, which is useful for rewriting mirrored webpage archive urls.
here are some basic utilities that get what i want done 90% of the time.
local function attrs_to_dict(node)
if not node.attributes then
return {}
end
local attrs = {}
for _, x in ipairs(node.attributes) do
if attrs[x.name] then
if type(attrs[x.name]) ~= 'table' then
attrs[x.name] = {attrs[x.name]}
end
t_insert(attrs[x.name], x.value)
else
attrs[x.name] = x.value
end
end
return attrs
end
-- fn: function(tag, attrs, data, children, node)
local function preorder_html(node, fn)
assert(node, "preorder called on nil node")
local attrs = attrs_to_dict(node)
local result = fn(node.tagName, attrs, node.data, node.childNodes, node)
if result then return result end
for _, c in ipairs(node.childNodes or {}) do
result = preorder_html(c, fn)
if result then
return result
end
end
end
local function html_text(node)
local texts = {}
preorder_html(node, function(tag, attrs, text, children, node)
texts[#texts+1] = text
end)
return texts
end
Hello community,
I faced a requirement of having an HTML parser. I have tried the HTML parser available with lua msva/lua-htmlparser, but unfortunately, it did not work for me failing in circular loop.
After exploring alternatives, I found the lexbor HTML parser which is written in C and doesn't have external dependencies. I have successfully managed to build a static version of it using the Cosmopolitan C Library.
For building, the following commands were used:
Which is able to build the static libary and examples sucessfully:
For now I'm up to build a simple utiliy to parse html files I need and use
unix.execve
from redbean to run it.I wonder if there are any other options of using the binary dependencies without rebuilding redbean to include the library.
Thanks and best regards!