Hext is a domain-specific language for extracting structured data from HTML documents.
Hext is written in C++ but language bindings are available for Python, Node, JavaScript, Ruby and PHP.
See https://hext.thomastrapp.com for documentation, installation instructions and a live demo.
The Hext project is released under the terms of the Apache License v2.0.
Suppose you want to extract all hyperlinks from a web page. Hyperlinks have an
anchor tag <a>, an attribute called href and a text that visitors can
click. The following Hext template will produce a dictionary for every matched
element. Each dictionary will contain the keys link
and title
which refer
to the href attribute and the text content of the matched <a>.
# Extract links and their text
<a href:link @text:title />
Visit Hext's project page to learn more about
Hext. For examples that use the libhext C++ library check out /libhext/examples
and
libhext's C++ library overview.
htmlext
: Command line utility that applies Hext templates to an HTML document
and produces JSON.libhext
: C++ library that contains a Hext parser but also allows for
customization.libhext-test
: Unit tests for libhext.Hext bindings
: Bindings for scripting languages. There are extensions for
Node.js, Python, Ruby and PHP that are able to parse Hext and extract values
from HTML.├── build Build directory for htmlext
├── cmake CMake modules used by the project
├── htmlext Source for the htmlext command line tool
├── libhext The libhext project
│ ├── bindings Hext bindings for scripting languages
│ ├── build Build directory for libhext
│ ├── doc Doxygen documentation for libhext
│ ├── examples Examples making use of libhext
│ ├── include Public libhext API
│ ├── ragel Ragel input files
│ ├── scripts Helper scripts for libhext
│ ├── src libhext implementation files
│ └── test The libhext-test project
│ ├── build Build directory for libhext-test
│ └── src Source for libhext-test
├── man Htmlext man page
├── scripts Scripts for building and testing releases
├── syntaxhl Syntax highlighters for Vim and ACE
└── test Blackbox tests for htmlext
There are unit tests for libhext and blackbox tests for Hext as a language,
whose main purpose is to detect unwanted change in syntax or behavior.
The libhext-test project is located in /libhext/test
and depends on Google
Test. Nothing fancy, just build the project and run the executable
libhext-test
. How to write test cases with Google Test is described
here.
The blackbox tests are located in /test
. There you'll find a shell script
called blackbox.sh
. This script applies Hext templates to HTML documents and
compares the result to a third file that contains the expected output. For
example, there is a test case icase-quoted-regex
that consists of three files:
icase-quoted-regex.hext
, icase-quoted-regex.html
, and
icase-quoted-regex.expected
. To run this test case you would do the following:
$ ./blackbox.sh case/icase-quoted-regex.hext
blackbox.sh
will then look for the corresponding .html
and .expected
files
of the same name in the directory of icase-quoted-regex.hext
. Then it will
invoke htmlext
with the given Hext template and HTML document and compare the
result to icase-quoted-regex.expected
. To run all blackbox tests in
succession:
$ ./blackbox.sh case/*.hext
By default blackbox.sh
will look for the htmlext
binary in $PATH
. Failing
that, it looks for the binary in the default build directory. You can tell
blackbox.sh
which command to use by setting HTMLEXT. For example, to run all
tests through valgrind you'd run the following:
$ HTMLEXT="valgrind -q ../build/htmlext" ./blackbox.sh case/*.hext
hext::Html
. It's fast, easy to
integrate and even fixes invalid HTML./libhext/ragel/hext-machine.rl
.htmlext
command line utility.htmlext
into jq
lets you do all sorts of crazy things./syntaxhl/ace
. Also, there's a script in
/libhext/scripts/syntax-hl-ace
that uses Ace to transform a code template
into highlighted HTML.