osa1 / lexgen

A fully-featured lexer generator, implemented as a proc macro
MIT License
63 stars 7 forks source link
compilers lexer-generator rust

lexgen: A fully-featured lexer generator, implemented as a proc macro

use lexgen::lexer;
use lexgen_util::Loc;

lexer! {
    // First line specifies name of the lexer and the token type returned by
    // semantic actions
    Lexer -> Token;

    // Regular expressions can be named with `let` syntax
    let init = ['a'-'z'];
    let subseq = $init | ['A'-'Z' '0'-'9' '-' '_'];

    // Rule sets have names. Each rule set is compiled to a separate DFA.
    // Switching between rule sets is done explicitly in semantic actions.
    rule Init {
        // Rules without a right-hand side for skipping whitespace,
        // comments, etc.
        [' ' '\t' '\n']+,

        // Rule for matching identifiers
        $init $subseq* => |lexer| {
            let token = Token::Id(lexer.match_().to_owned());
            lexer.return_(token)
        },
    }
}

// The token type
#[derive(Debug, PartialEq, Eq)]
enum Token {
    // An identifier
    Id(String),
}

// Generated lexers are initialized with a `&str` for the input
let mut lexer = Lexer::new(" abc123Q-t  z9_9");

// Lexers implement `Iterator<Item=Result<(Loc, T, Loc), LexerError>>`,
// where `T` is the token type specified in the lexer definition (`Token` in
// this case), and `Loc`s indicate line, column, and byte indices of
// beginning and end of the lexemes.
assert_eq!(
    lexer.next(),
    Some(Ok((
        Loc { line: 0, col: 1, byte_idx: 1 },
        Token::Id("abc123Q-t".to_owned()),
        Loc { line: 0, col: 10, byte_idx: 10 }
    )))
);
assert_eq!(
    lexer.next(),
    Some(Ok((
        Loc { line: 0, col: 12, byte_idx: 12 },
        Token::Id("z9_9".to_owned()),
        Loc { line: 0, col: 16, byte_idx: 16 }
    )))
);
assert_eq!(lexer.next(), None);

See also:

Motivation

Implementing lexing is often (along with parsing) the most tedious part of implementing a language. Lexer generators make this much easier, but in Rust existing lexer generators miss essential features for practical use, and/or require a pre-processing step when building.

My goal with lexgen is to have a feature-complete and easy to use lexer generator.

Usage

lexgen doesn't require a build step. Add same versions of lexgen and lexgen_util as dependencies in your Cargo.toml.

Lexer syntax

lexgen lexers start with the name of the generated lexer struct, optional user state part, and the token type (type of values returned by semantic actions). Example:

lexer! {
    Lexer(LexerState) -> Token;
    ...
}

Here the generated lexer type will be named Lexer. User state type is LexerState (this type should be defined by the user). The token type is Token.

After the lexer name and user state and token types we define the rules:

rule Init {
    ...
}

rule SomeOtherRule {
    ...
}

The first rule set will be defining the initial state of the lexer and needs to be named Init.

In the body of a rule block we define the rules for that lexer state. The syntax for a rule is <regex> => <semantic action>,. Regex syntax is described below. A semantic action is any Rust code with the type fn(LexerHandle) -> LexerAction where LexerHandle and LexerAction are generated names derived from the lexer name (Lexer in our example). More on these types below.

Regular expressions can be named with let <name> = <regex>; syntax. Example:

let init = ['a'-'z'];
let subseq = $init | ['A'-'Z' '0'-'9' '-' '_'];

// Named regexes can be used with the `$` prefix
$init $subseq* => |lexer| { ... }

You can omit the rule Init { ... } part and have all of your rules at the top level if you don't need rule sets.

In summary:

Regex syntax

Regex syntax can be used in right-hand side of let bindings and left-hand side of rules. The syntax is:

Binding powers (precedences), from higher to lower:

You can use parenthesis for grouping, e.g. ('a' | 'b')*.

Example: 'a' 'b' | 'c'+ is the same as (('a' 'b') | ('c'+)).

Right context (lookahead)

A rule in a rule set can be followed by another regex using > <regex> syntax, for right context. Right context is basically a limited form of lookahead: they can only appear after a top-level regex for a rule. They cannot be used nested in a regex.

For example, the rule left-hand side 'a' > (_ # 'b') matches 'a' as long as it's not followed by 'b'.

See also right context tests for more examples.

Built-in regular expressions

lexgen comes with a set of built-in regular expressions. Regular expressions listed below match the same set of characters as their Rust counterparts. For example, $$alphabetic matches the same set of characters as Rust's char::is_alphabetic:

(Note that in the generated code we don't use Rust char methods. For simple cases like $$ascii we generate simple range checks. For more complicated cases like $$lowercase we generate a binary search table and run binary search when checking a character)

In addition, these two built-in regular expressions match Unicode [XID_Start and XID_Continue]:

Rule syntax

End-of-input handling in rule sets

The Init rule set terminates lexing successfully on end-of-input (i.e. lexer.next() returns None). Other rule sets fail on end-of-input (i.e. return Some(Err(...))). This is because generally the states other than the initial one are for complicated tokens (strings, raw strings, multi-line comments) that need to be terminated and handled, and end-of-input in those states usually means the token did not terminate properly.

(To handle end-of-input in a rule set you can use $ as described in section "Regex syntax" above.)

Handle, rule, error, and action types

The lexer macro generates a struct with the name specified by the user in the first line of the lexer definition. In the example at the beginning (Lexer -> Token;), name of the struct is Lexer.

A mut reference to this type is passed to semantic action functions. In the implementation of a semantic action, you should use one of the methods below drive the lexer and return tokens:

Semantic action functions should return a SemanticActionResult value obtained from one of the methods listed above.

Initializing lexers

lexgen generates 4 constructors:

Stateful lexer example

Here's an example lexer that counts number of =s appear between two [s:

lexer! {
    // `usize` in parenthesis is the user state type, `usize` after the arrow
    // is the token type
    Lexer(usize) -> usize;

    rule Init {
        $$ascii_whitespace,                             // line 7

        '[' => |lexer| {
            *lexer.state() = 0;                         // line 10
            lexer.switch(LexerRule::Count)              // line 11
        },
    }

    rule Count {
        '=' => |lexer| {
            *lexer.state() += 1;                        // line 17
            lexer.continue_()                           // line 18
        },

        '[' => |lexer| {
            let n = *lexer.state();
            lexer.switch_and_return(LexerRule::Init, n) // line 23
        },
    }
}

let mut lexer = Lexer::new("[[ [=[ [==[");
assert_eq!(
    lexer.next(),
    Some(Ok((
        Loc { line: 0, col: 0, byte_idx: 0 },
        0,
        Loc { line: 0, col: 2, byte_idx: 2 },
    )))
);
assert_eq!(
    lexer.next(),
    Some(Ok((
        Loc { line: 0, col: 3, byte_idx: 3 },
        1,
        Loc { line: 0, col: 6, byte_idx: 6 },
    )))
);
assert_eq!(
    lexer.next(),
    Some(Ok((
        Loc { line: 0, col: 7, byte_idx: 7 },
        2,
        Loc { line: 0, col: 11, byte_idx: 11 },
    )))
);
assert_eq!(lexer.next(), None);

Initially (the Init rule set) we skip spaces (line 7). When we see a [ we initialize the user state (line 10) and switch to the Count state (line 11). In Count, each = increments the user state by one (line 17) and skips the match (line 18). A [ in Count state returns the current number and switches to the Init state (line 23).