tree-sitter / tree-sitter-haskell

Haskell grammar for tree-sitter.
MIT License
151 stars 36 forks source link
haskell parser tree-sitter

tree-sitter-haskell

CI discord matrix crates npm pypi

Haskell grammar for tree-sitter.

References

Supported Language Extensions

These extensions are supported ✅, unsupported ❌ or not applicable because they don't involve parsing ➖️:

Bugs

CPP

Preprocessor #elif and #else directives cannot be handled correctly, since the parser state would have to be manually reset to what it was at the #if. As a workaround, the code blocks in the alternative branches are parsed as part of the directives.

Querying

The grammar contains several supertypes, which group multiple other node types under a single name.

Supertype names do not occur as extra nodes in parse trees, but they can be used in queries in special ways:

For example, the query (expression) matches the nodes infix, record, projection, constructor, and the second and third variable in this tree for cats <> Cat {mood = moods.sleepy}:

(infix
  (variable)
  (operator)
  (record
    (constructor)
    (field_update
      (field_name (variable))
      (projection (variable) (field_name (variable)))))))))

The two occurrences of variable in field_name (mood and sleepy) are not expressions, but record field names part of a composite record expression.

Matching variable nodes specifically that are expressions is possible with the second special form. A query for (expression/variable) will match only the other two, cats and moods.

The grammar's supertypes consist of the following sets:

Development

The main driver for generating and testing the parser for this grammar is the tree-sitter CLI. Other components of the project require additional tools, described below.

Some are made available through npm – for example, npx tree-sitter runs the CLI. If you don't have tree-sitter available otherwise, prefix all the commands in the following sections with npx.

Output path

The CLI writes the shared library containing the parser to the directory denoted by $TREE_SITTER_LIBDIR. If that variable is unset, it defaults to $HOME/.cache/tree-sitter/lib.

In order to avoid clobbering this global directory with development versions, you can set the env var to a local path:

export TREE_SITTER_LIBDIR=$PWD/.lib

The grammar

The javascript file grammar.js contains the entry point into the grammar's production rules. Please consult the tree-sitter documentation for a comprehensive introduction to the syntax and semantics.

Parsing starts with the first item in the rules field:

{
  rules: {
    haskell: $ => seq(
      optional($.header),
      optional($._body),
    ),
  }
}

Generating the parser

The first step in the development workflow converts the javascript rule definitions to C code in src/parser.c:

$ tree-sitter generate

Two byproducts of this process are written to src/grammar.json and src/node-types.json.

Compiling the parser

The C code is automatically compiled by most of the test tools mentioned below, but you can instruct tree-sitter to do it in one go:

$ tree-sitter generate --build

If you've set $TREE_SITTER_LIBDIR as mentioned above, the shared object will be written to $PWD/.lib/haskell.so.

Aside from the generated src/parser.c, tree-sitter will also compile and link src/scanner.c into this object. This file contains the external scanner, which is a custom extension of the built-in lexer whose purpose is to handle language constructs that cannot be expressed (efficiently) in the javascript grammar, like Haskell layouts.

WebAssembly

The parser can be compiled to WebAssembly as well, which requires emscripten:

$ tree-sitter build --wasm

The resulting binary is written to $PWD/tree-sitter-haskell.wasm.

Testing the parser

The most fundamental test infrastructure for tree-sitter grammars consists of a set of code snippets with associated reference ASTs stored in ./test/corpus/*.txt.

$ tree-sitter test

Individual tests can be run by specifying (a substring of) their description with -f:

$ tree-sitter test -f 'module: exports empty'

The project contains several other types of tests:

Debugging

The shared library built by tree-sitter test includes debug symbols, so if the scanner segfaults you can just run coredumpctl debug to inspect the backtrace and memory:

newline_lookahead () at src/scanner.c:2583
2583                ((Newline *) 0)->indent = 5;
(gdb) bt
#0  newline_lookahead () at src/scanner.c:2583
#1  0x00007ffff7a0740e in newline_start () at src/scanner.c:2604
#2  scan () at src/scanner.c:2646
#3  eval () at src/scanner.c:2684
#4  tree_sitter_haskell_external_scanner_scan (payload=<optimized out>, lexer=<optimized out>,
    valid_symbols=<optimized out>) at src/scanner.c:2724
#5  0x0000555555772488 in ts_parser.lex ()

For more control, launch gdb tree-sitter and start the process with run test -f 'some test', and set a breakpoint with break tree_sitter_haskell_external_scanner_scan.

To disable optimizations, run tree-sitter test --debug-build.

Tracing

The test and parse commands offer two modes for obtaining detailed information about the parsing process.

With tree-sitter test --debug, every lexer step and shift/reduce action is printed to stderr.

With tree-sitter test --debug-graph, the CLI will generate an HTML file showing a graph representation of every step. This requires graphviz to be installed.