osa1 / lexgen

A fully-featured lexer generator, implemented as a proc macro
MIT License
63 stars 7 forks source link

Implement an easy way to get the matched char in a wildcard rule #10

Open osa1 opened 3 years ago

osa1 commented 3 years ago

Currently getting the matched character in a wildcard is quite verbose (and probably also inefficient):

_ => |mut lexer| {
    let char = lexer.match_().chars().next_back().unwrap();
    ...
}

One easy fix would be to add a char method to lexer that returns the last matched character.

Alternatively with #9 we could allow <char:_> => ... syntax.

osa1 commented 2 years ago

Another example:

let whitespace =
    ['\t' '\n' '\u{B}' '\u{C}' '\r' ' ' '\u{85}' '\u{200E}' '\u{200F}' '\u{2028}' '\u{2029}'];

rule DecInt {
    ($dec_digit | '_')* $int_suffix?,

    $ => |lexer| {
        let match_ = lexer.match_();
        lexer.return_(Token::Lit(Lit::Int(match_)))
    },

    $whitespace => |lexer| {
        let match_ = lexer.match_();
        // TODO: Rust whitespace characters 1, 2, or 3 bytes long
        lexer.return_(Token::Lit(Lit::Int(&match_[..match_.len() - match_.chars().last().unwrap().len_utf8()])))
    },
}

In the last rule we want to exclude the trailing whitespace. We can't just drop the last byte as the allowed whitespace characters can be 1, 2, or 3 bytes long. If we could bind the whitespace character we could do match_[..match.len() - whitespace_clar.len_utf8()].