jart / cosmopolitan

build-once run-anywhere c library
ISC License
17.84k stars 610 forks source link

HTML parser to be used with redbean #1055

Open OEvgeny opened 8 months ago

OEvgeny commented 8 months ago

Hello community,

I faced a requirement of having an HTML parser. I have tried the HTML parser available with lua msva/lua-htmlparser, but unfortunately, it did not work for me failing in circular loop.

After exploring alternatives, I found the lexbor HTML parser which is written in C and doesn't have external dependencies. I have successfully managed to build a static version of it using the Cosmopolitan C Library.

For building, the following commands were used:

cmake  . -DLEXBOR_BUILD_TESTS=ON -DLEXBOR_BUILD_EXAMPLES=ON -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -D CMAKE_C_COMPILER=x86_64-unknown-cosmo-cc -DLEXBOR_C_FLAGS="-Wall -pedantic -std=c99" -D CMAKE_FIND_LIBRARY_SUFFIXES=".a" -D BUILD_SHARED_LIBS=OFF -D CMAKE_EXE_LINKER_FLAGS="-static" -DLEXBOR_BUILD_SHARED=OFF
make

Which is able to build the static libary and examples sucessfully:

> du -h ./lexbor/liblexbor_static.a 
1.7M    ./lexbor/liblexbor_static.a

> du -h ./lexbor/examples/lexbor/html/document_parse
785K    ./lexbor/examples/lexbor/html/document_parse

For now I'm up to build a simple utiliy to parse html files I need and use unix.execve from redbean to run it.

I wonder if there are any other options of using the binary dependencies without rebuilding redbean to include the library.

Thanks and best regards!

mingodad commented 8 months ago

Also look at this project https://github.com/GerHobbelt/gumbo-parser .

OEvgeny commented 8 months ago

Thanks, for future travelers looking into this: there is a gumbo-parser fork from above comment, and also there is the updated lua-specific version of the original google/gumbo-parser repo developed independantly on Gitlab lua-gumbo/-/tree/master

From another issue discussion it seems the only way to use the lua-specific version would be to compile it into the redbean binary.

norcalli commented 6 days ago

I've used gumbo for years now via Lua successfully. If you aren't aware, there's already a luarocks package https://github.com/craigbarnes/lua-gumbo luarocks install gumbo

It's pretty battle tested by google, quick to compile, and while it might not be the fastest, it's fairly reliable and full featured. It lets you modify the html nodes, which is useful for rewriting mirrored webpage archive urls.

here are some basic utilities that get what i want done 90% of the time.

local function attrs_to_dict(node)
  if not node.attributes then
    return {}
  end
  local attrs = {}
  for _, x in ipairs(node.attributes) do
    if attrs[x.name] then
      if type(attrs[x.name]) ~= 'table' then
        attrs[x.name] = {attrs[x.name]}
      end
      t_insert(attrs[x.name], x.value)
    else
      attrs[x.name] = x.value
    end
  end
  return attrs
end

-- fn: function(tag, attrs, data, children, node)
local function preorder_html(node, fn)
  assert(node, "preorder called on nil node")
  local attrs = attrs_to_dict(node)
  local result = fn(node.tagName, attrs, node.data, node.childNodes, node)
  if result then return result end
  for _, c in ipairs(node.childNodes or {}) do
    result = preorder_html(c, fn)
    if result then
      return result
    end
  end
end

local function html_text(node)
  local texts = {}
  preorder_html(node, function(tag, attrs, text, children, node)
    texts[#texts+1] = text
  end)
  return texts
end