YAEP -- standalone Earley parser library
- YAEP is an abbreviation of Yet Another Earley Parser.
- This standalone library is created for convenience.
- The parser development is actually done as a part of the Dino language
project.
- YAEP is licensed under the MIT license.
YAEP features:
- It is sufficiently fast and does not require much memory.
This is the fastest implementation of the Earley parser which I
know of. If you know a faster one, please send me a message. It can parse
300K lines of C program per second on modern computers
and allocates about 5MB memory for 10K line C program.
- YAEP does simple syntax directed translation, producing an abstract
syntax tree as its output.
- It can parse input described by an ambiguous grammar. In
this case the parse result can be a single abstract tree or all
possible abstract trees. YAEP produces a compact
representation of all possible parse trees by using DAG instead
of real trees.
- YAEP can parse input described by an ambiguous grammar
according to abstract node costs. In this case the parse
result can be a minimal cost abstract tree or all possible
minimal cost abstract trees. This feature can be used to code
selection task in compilers.
- It can perform syntax error recovery. Moreover its error
recovery algorithm finds error recovery with a minimal number of
ignored tokens. This permits implementing parsers with very good
error recovery and reporting.
- It has fast startup. There is only a tiny and insignificant delay
between processing grammar and the start of parsing.
- A grammar for YAEP can be constructed through function calls or using
a YACC-like description syntax.
Usage example:
- The following is a small example of how to use YAEP to parse expressions.
We have omitted the functions
read_token
, syntax_error_func
,
and parse_alloc_func
which are needed to provide tokens, print syntax
error messages, and allocate memory for the parser.
static const char *description =
"\n"
"TERM NUMBER;\n"
"E : T # 0\n"
" | E '+' T # plus (0 2)\n"
" ;\n"
"T : F # 0\n"
" | T '*' F # mult (0 2)\n"
" ;\n"
"F : NUMBER # 0\n"
" | '(' E ')' # 1\n"
" ;\n"
;
static void parse (void)
{
struct grammar *g;
struct earley_tree_node *root;
int ambiguous_p;
if ((g = earley_create_grammar ()) == NULL) {
fprintf (stderr, "earley_create_grammar: No memory\n");
exit (1);
}
if (earley_parse_grammar (g, TRUE, description) != 0) {
fprintf (stderr, "%s\n", earley_error_message (g));
exit (1);
}
if (earley_parse (g, read_token_func, syntax_error_func, parse_alloc_func,
NULL, &root, &ambiguous_p))
fprintf (stderr, "earley_parse: %s\n", earley_error_message (g));
earley_free_grammar (g);
}
Installing:
mkdir build
cd build
<srcdir>/configure --srcdir=<srcdir> --prefix=<prefix for install dirs>
or cmake -DCMAKE_BUILD_TYPE=Release
(make sure you have CMake installed)
make
make test
(optional)
make install
Speed comparison of YACC, MARPA, YAEP, and GCC parsers:
-
Tested parsers:
- YACC 1.9 from Linux Fedora Core 21.
- MARPA C Library, version 8.3.0. A popular Earley parser implementation
using the Practical Earley Parser algorithm and Leo Joop's approach.
- The C parser in GCC-4.9.2.
- YAEP as of Oct. 2015.
-
Grammar:
- The base test grammar is the ANSI C grammar which is mostly
a left recursion grammar.
- For MARPA and YAEP, the grammar is slightly ambiguous as typenames
are represented with the same kind of token as identifiers.
- For the YACC description, typename is a separate token type distinct from
other identifiers. The YACC description does not contain any actions except
for a small number needed to give feedback to the scanner on how to treat
the next identifier (as a typename or regular identifier).
-
Scanning test files for YACC, MARPA, and YAEP:
- We prepare all tokens beforehand in order to exclude scanning time from our benchmark.
- For YACC, at the scanning stage we do not yet distinguish identifiers and typenames.
-
Tests:
- The first test is based on the file
gen.c
from parser-generator MSTA. The file
was concatenated 10 times and the resulting file size was 67K C lines.
- The second test is a pre-release version of gcc-4.0 for i686 with all the source
code combined into one file
(source).
The file size was 635K C lines.
- The C pre-processor was applied to the files.
- Additional preparations were made for YACC, MARPA, and YAEP:
- GCC extensions (mostly attributes and asm) were removed from the
pre-processed files. The removed code is a tiny and insignificant
fraction of the entire code.
- A very small number of identifiers were renamed to avoid confusing the simple
YACC actions to distinguish typenames and identifiers. So the resulting code
is not correct as C code but it is correct from the syntactic point of view.
-
Measurements:
- The result times are elapsed (wall) times.
- Memory requirements are measured by comparing the output of Linux
sbrk
before and
after parsing.
- For GCC, memory was instead measured as max resident memory reported by
/usr/bin/time
.
-
How to reproduce: please use the shell script compare-parsers.tst
from directory src
.
-
Results:
- First file (67K lines). Test machine is i7-2600 (4 x 3.4GHz)
with 8GB memory under FC21.
|
Parse time only |
Overall |
Memory (parse only) MB |
YACC |
0.07 |
0.17 |
20 |
MARPA |
3.48 |
3.77 |
516 |
YAEP |
0.18 |
0.28 |
26 |
- Second file (635K lines). Test machine is 2xE5-2697 (2 x 14 x 2.6GHz)
with 128GB memory under FC21.
|
Parse time only |
Overall |
Memory (parse only) MB |
YACC |
0.25 |
0.55 |
120 |
gcc -fsyntax-only |
- |
1.22 |
194 |
gcc -O0 |
- |
19.37 |
761 |
MARPA |
22.23 |
23.41 |
30310 |
YAEP |
1.43 |
1.68 |
142 |
- Conclusions:
- YAEP without a scanner is up to 20 times faster than Marpa and requires
up to 200 times less memory.
- Still, it is 2.5 - 6 times slower (1.6 - 3 times when
taking the scanner into account) than YACC.
Future directions
- Implement YACC-style description syntax for operator precedence and associativity.
- Implement bindings for popular scripting languages.
- Introduce abstract node codes (instead of string labels) for faster work with abstract trees.
- Permit nested abstract nodes in simple translation.