osa1 / lexgen

A fully-featured lexer generator, implemented as a proc macro
MIT License
63 stars 7 forks source link

Lookahead could be useful #29

Closed osa1 closed 2 years ago

osa1 commented 2 years ago

Suppose I'm writing a Rust lexer. Naively, lexing integer literals could be implemented like this:

$dec_digit ($dec_digit | '_')* $int_suffix? => ...,

"0b" ($bin_digit | '_')* $bin_digit ($bin_digit | '_')* $int_suffix? => ...,

"0o" ($oct_digit | '_')* $oct_digit ($oct_digit | '_')* $int_suffix? => ...,

"0x" ($hex_digit | '_')* $hex_digit ($hex_digit | '_')* $int_suffix? => ...,

The problem with these rules is they generate multiple tokens for invalid literals like 0invalidSuffix or 123AFB43.

Currently the way to fix this is by introducing new rules for each of these literals:

rule Init {
    ...

    $dec_digit => |lexer| lexer.switch(LexerRule::DecInt),

    "0b" => |lexer| lexer.switch(LexerRule::BinInt),

    "0o" => |lexer| lexer.switch(LexerRule::OctInt),

    "0x" => |lexer| lexer.switch(LexerRule::HexInt),
}

rule DecInt {
    ($dec_digit | '_')* $int_suffix?,

    $ => ... return token ...,

    $whitespace => ... return token ...,
}

rule BinInt { ... }

rule OctInt { ... }

rule HexInt { ... }

What these rules do is when we see a character for an integer literal, we consume every character until the end of the input or we see a whitespace. Seeing an invalid character for the literal becomes an error.

If we had a regex for "followed by", we could use the rules in the first version, with a "followed by whitespace or end-of-input" part:

$dec_digit ($dec_digit | '_')* $int_suffix? > ($whitespace | $) => ...,

where > is the "followed by" syntax.

The difference from concatenation is the "followed by" part should not be included in the match.

osa1 commented 2 years ago

Alternatively, if we implement #9 we could simply bind the integer part to a variable and use it in the semantic action. Something like:

int:($dec_digit ($dec_digit | '_')* $int_suffix?) ($whitespace | $) => ... use variable `int` ...,
osa1 commented 2 years ago

I think the "followed by" syntax should actually be called "lookahead" and it shouldn't consume the matched part of the input (i.e. backtrack after the match).

As an example, rustc's lexer does not include newlines at the end of single-line comments. Doing this is currently a bit verbose in lexgen:

rule SinglelineComment {
    _ # '\n' => |lexer| {
        if lexer.peek() == Some('\n') {
            let comment = lexer.match_();
            lexer.switch_and_return(LexerRule::Init, Token::Comment(comment))
        } else {
            lexer.continue_()
        }
    },

    $ => |lexer| {
        let comment = lexer.match_();
        lexer.switch_and_return(LexerRule::Init, Token::Comment(comment))
    },
}

With the lookahead syntax suggested above, it would look like:

rule SinglelineComment {
    (_ # '\n')* > ('\n' | $) => |lexer| {
        let comment = lexer.match_();
        lexer.switch_and_return(LexerRule::Init, Token::Comment(comment))
    }
}

After this simplification, a new rule set is not necessary anymore, we can inline this rule in the use site:

rule Init {
    ...

    "//" (_ # '\n')* > ('\n' | $) => |lexer| {
        let comment = lexer.match_();
        lexer.return_(Token::Comment(comment))
    },
}
osa1 commented 2 years ago

Another use of lookahead would be when lexing Rust floats.

In Rust, 1. is a float, but 1.. is an integer, and then .. (the range operator, std::ops::Range, std::ops::RangeFrom, etc.). The reference describes this syntax with:

DECLITERAL . (not immediately followed by ., , or an identifier)

"not immediately followed by" part could be implemented using the lookahead operator:

rule Init {
    ...
    $dec_literal '.' > (_ # ('.' | '_' | $$XID_Start) | $) => |lexer| {
        let match_ = lexer.match_();
        lexer.return_(Token::Lit(Lit::Float(match_)))
    },
}
osa1 commented 2 years ago

This feature doesn't seem to be "zero cost", in the sense that if we implement it everyone will pay the price in terms of runtime, even if they don't use the feature.

Here's the problem. Suppose I'm macthing these two rules:

("a" > "aa") "aab" => ...,
"aaac" => ...,

First rule matches "aaab", second rule matches "aaac".

Now let's imagine how an NFA for these rules would work. After scanning the prefix "aaa", for the first rule we need to backtrack, but for the second rule we need to continue. So the state machine needs to maintain separate iterators for these rules. For the first rule, the iterator will be reverted. In the second rule, we will increment the iterator.

This is already too complicated, and I don't think we can avoid "forking" the iterator to implement this in the generated code. But there's more. Because the rule that comes first has the priority, in this lexer, we want to run the semantic action for the first rule. But if we keep running these "forked" state machines concurrently, one character at a time, the second rule will match first. Because it will see "b" and accept the input, while the state machine for the first rule will see the second "a" (because we reverted the iterator).

So when the user calls next(), we will have a set of iterators. We need to find the one that is behind all others, and run it until it catches the other iterators. When all iterators are at the same location in input, we make progress in all state machines. We also need to assign a priority to all states (just give numbers to the rules, rule that comes later in code will have larger number/less priority) so that if multiple machines accept the input at the same time we can pick the one with highest priority (i.e. comes first in lexer definition).

It's complicated to implement, and it will be costly to everyone, even to users who don't use this feature.

osa1 commented 2 years ago

There could be other implementations that do not fit into the NFA/DFA model.

For example, in the example above, when we see the first "a", we could run a new state machine to match the lookahead part. We then move to states for the rest of the both rules, or just the second rule, depending on the result of that state machine.

Another idea could be implementing a VM with a "split" or "fork" instruction.

In both of these cases the worst case time will be bad though. When we have multiple lookaheads, we will run each of them separately, visiting rest of the input again and again for each one of them.

osa1 commented 2 years ago

alex provides this feature, with the name "right context". See https://www.haskell.org/alex/doc/html/alex-files.html section 3.2.2.1.

osa1 commented 2 years ago

As far as I understand, "right context" is a simpler version of lookahead. They're not part of the regex syntax, so cannot be nested within regexes. They're more like guards in pattern matching. Similar to how each alternative in pattern matching can have one guard, each rule can have one right context. When the rule matches, the right context is matched before the rule is successful. If it doesn't match, then the rule is not successful.

I think the implementation is straightforward. Compile all right contexts to separate NFAs/DFAs. Add an optional right context (reference to the state machine initial state) to accepting states. When we're in an accepting state, before updating last_match clone the char iterator and run the machine for the right context. Update last_match only when the right context matches.

osa1 commented 2 years ago

I started implementing right contexts in #37. NFA simulation is straightforward, but DFA simulation requires some changes in DFA accepting states. Suppose I have these rules:

'a' > 'a' = 1,
'a' > 'b' = 2,
'a' > 'c' = 3,
...
'a' = 27,

(> is the right context syntax)

In the NFA we will have 27 epsilon transitions to each of these rules. First 26 rules will have a right context. The last rule won't have a right context.

When the input is "a", we will try each of these 26 right contexts, and finally try the last rule without the right context, which will match.

In the DFA, without right contexts, in an accepting state we simply update last_match and that's it. But with right contexts, when we have multiple right contexts after a prefix, we need to try each of those right contexts until we find one that matches. This means generalizing DFA accepting states a little bit to store a list of (right context, semantic action idx) pairs, instead of just a single semantic action index.


One design question here is what to do when we have something like

'a' = 1,
'a' > 'a' = 2,

and the input is "aa". We have two options:

  1. Both rules yield the same match, and first one has priority, so return [1, 1]
  2. Second rule scans a longer substring, so per "longest match" rule return [2, 1]

I feel like (2) will be more intuitive.