h0tk3y / better-parse

A nice parser combinator library for Kotlin
Apache License 2.0
416 stars 42 forks source link
dsl grammar kotlin language parser parser-combinator syntax-trees

better-parse

Maven Central Gradle build

A nice parser combinator library for Kotlin JVM, JS, and Multiplatform projects

val booleanGrammar = object : Grammar<BooleanExpression>() {
    val id by regexToken("\\w+")
    val not by literalToken("!")
    val and by literalToken("&")
    val or by literalToken("|")
    val ws by regexToken("\\s+", ignore = true)
    val lpar by literalToken("(")
    val rpar by literalToken(")")

    val term by 
        (id use { Variable(text) }) or
        (-not * parser(this::term) map { Not(it) }) or
        (-lpar * parser(this::rootParser) * -rpar)

    val andChain by leftAssociative(term, and) { l, _, r -> And(l, r) }
    override val rootParser by leftAssociative(andChain, or) { l, _, r -> Or(l, r) }
}

val ast = booleanGrammar.parseToEnd("a & !b | b & (!a | c)")

Using with Gradle

dependencies {
   implementation("com.github.h0tk3y.betterParse:better-parse:0.4.4")
}

With multiplatform projects, it's OK to add the dependency just to the commonMain source set, or some other source set if you want it for specific parts of the code.

Tokens

As many other language recognition tools, better-parse abstracts away from raw character input by pre-processing it with a Tokenizer, that can match Tokens (with regular expressions, literal values or arbitrary against an input character sequence.

There are several kinds of supported Tokens:

A Tokenizer tokenizes an input sequence such as InputStream or a String into a Sequence<TokenMatch>, providing each with a position in the input.

One way to create a Tokenizer is to first define the Tokens to be matched:

val id = regexToken("\\w+")
val cm = literalToken(",")
val ws = regexToken("\\s+", ignore = true)

A Token can be ignored by setting its ignore = true. An ignored token can still be matched explicitly, but if another token is expected, the ignored one is just dropped from the sequence.

val tokenizer = DefaultTokenizer(listOf(id, cm, ws))

Note: the tokens order matters in some cases, because the tokenizer tries to match them in exactly this order. For instance, if literalToken("a") is listed before literalToken("aa"), the latter will never be matched. Be careful with keyword tokens! If you match them with regexes, a word boundary \b in the end may help against ambiguity.

val tokenMatches: Sequence<TokenMatch> = tokenizer.tokenize("hello, world")

A more convenient way of defining tokens is described in the Grammar section.

It is possible to provide a custom implementation of a Tokenizer.

Parser

A Parser<T> is an object that accepts an input sequence (TokenMatchesSequence) and tries to convert some (from none to all) of its items into a T. In better-parse, parsers are also the building blocks used to create new parsers by combining them.

When a parser tries to process the input, there are two possible outcomes:

A very basic parser to start with is a Token itself: given an input TokenMatchesSequence and a position in it, it succeeds if the sequence starts with the match of this token itself (possibly, skipping some ignored tokens) and returns that TokenMatch, pointing at the next token with the nextPosition.

val a = regexToken("a+")
val b = regexToken("b+")
val tokenMatches = DefaultTokenizer(listOf(a, b)).tokenize("aabbaaa")
val result = a.tryParse(tokenMatches, 0) // contains the match for "aa" and the next index is 1 for the match of b

Combinators

Simpler parsers can be combined to build a more complex parser, from tokens to terms and to the whole language. There are several kinds of combinators included in better-parse:

Grammar

As a convenient way of defining a grammar of a language, there is an abstract class Grammar, that collects the by-delegated properties into a Tokenizer automatically, and also behaves as a composition of the Tokenizer and the rootParser.

Note: a Grammar also collects by-delegated Parser<T> properties so that they can be accessed as declaredParsers along with the tokens. As a good style, declare the parsers inside a Grammar by delegation as well.

interface Item
class Number(val value: Int) : Item
class Variable(val name: String) : Item

class ItemsParser : Grammar<List<Item>>() {
    val num by regexToken("\\d+")
    val word by regexToken("[A-Za-z]+")
    val comma by regexToken(",\\s+")

    val numParser by num use { Number(text.toInt()) }
    val varParser by word use { Variable(text) }

    override val rootParser by separatedTerms(numParser or varParser, comma)
}

val result: List<Item> = ItemsParser().parseToEnd("one, 2, three, 4, five")

To use a parser that has not been constructed yet, reference it with parser { someParser } or parser(this::someParser):

val term by
    constParser or 
    variableParser or 
    (-lpar and parser(this::term) and -rpar)

A Grammar implementation can override the tokenizer property to provide a custom implementation of Tokenizer.

Syntax trees

A Parser<T> can be converted to another Parser<SyntaxTree<T>>, where a SyntaxTree<T>, along with the parsed T contains the children syntax trees, the reference to the parser and the positions in the input sequence. This can be done with parser.liftToSyntaxTreeParser().

This can be used for syntax highlighting and inspecting the resulting tree in case the parsed result does not contain the full syntactic structure.

For convenience, a Grammar can also be lifted to that parsing a SyntaxTree with grammar.liftToSyntaxTreeGrammar().

val treeGrammar = booleanGrammar.liftToSyntaxTreeGrammar()
val tree = treeGrammar.parseToEnd("a & !b | c -> d")
assertTrue(tree.parser == booleanGrammar.implChain)
val firstChild = tree.children.first()
assertTrue(firstChild.parser == booleanGrammar.orChain)
assertTrue(firstChild.range == 0..9)

There are optional arguments for customizing the transformation:

See SyntaxTreeDemo.kt for an example of working with syntax trees.

Examples

Benchmarks

See the benchmarks repository h0tk3y/better-parse-benchmark and feel free to contribute.