osa1 / lexgen

A fully-featured lexer generator, implemented as a proc macro
MIT License
63 stars 7 forks source link

Rule templates? #30

Open osa1 opened 2 years ago

osa1 commented 2 years ago

Here are rules I'm using to lex Rust decimal, binary, octal, and hexadecimal numbers:

rule DecInt {
    $dec_digit,
    '_',

    $int_suffix | $ => |lexer| {
        let match_ = lexer.match_();
        lexer.switch_and_return(LexerRule::Init, Token::Lit(Lit::Int(match_)))
    },

    $whitespace => |lexer| {
        let match_ = lexer.match_();
        // TODO: Rust whitespace characters 1, 2, or 3 bytes long
        lexer.switch_and_return(
            LexerRule::Init,
            Token::Lit(Lit::Int(&match_[..match_.len() - match_.chars().last().unwrap().len_utf8()]))
        )
    },
}

rule BinInt {
    $bin_digit,
    '_',

    $int_suffix | $ => |lexer| {
        let match_ = lexer.match_();
        lexer.switch_and_return(LexerRule::Init, Token::Lit(Lit::Int(match_)))
    },

    $whitespace => |lexer| {
        let match_ = lexer.match_();
        // TODO: Rust whitespace characters 1, 2, or 3 bytes long
        lexer.switch_and_return(
            LexerRule::Init,
            Token::Lit(Lit::Int(&match_[..match_.len() - match_.chars().last().unwrap().len_utf8()]))
        )
    },
}

rule OctInt {
    $oct_digit,
    '_',

    $int_suffix | $ => |lexer| {
        let match_ = lexer.match_();
        lexer.switch_and_return(LexerRule::Init, Token::Lit(Lit::Int(match_)))
    },

    $whitespace => |lexer| {
        let match_ = lexer.match_();
        // TODO: Rust whitespace characters 1, 2, or 3 bytes long
        lexer.switch_and_return(
            LexerRule::Init,
            Token::Lit(Lit::Int(&match_[..match_.len() - match_.chars().last().unwrap().len_utf8()]))
        )
    },
}

rule HexInt {
    $hex_digit,
    '_',

    $int_suffix | $ => |lexer| {
        let match_ = lexer.match_();
        lexer.switch_and_return(LexerRule::Init, Token::Lit(Lit::Int(match_)))
    },

    $whitespace => |lexer| {
        let match_ = lexer.match_();
        // TODO: Rust whitespace characters 1, 2, or 3 bytes long
        lexer.switch_and_return(
            LexerRule::Init,
            Token::Lit(Lit::Int(&match_[..match_.len() - match_.chars().last().unwrap().len_utf8()]))
        )
    },
}

These rules are all the same, except the "digit" part: for binary numbers I'm using $bin_digit regex for the digits, for hex I'm using $hex_digit, and similar for other rules.

If we could implement "rule templates" that take regex as arguments, we could do have one template with a "digit" parameter, and pass $hex_digit, $oct_digit, etc. to it and avoid duplication.

osa1 commented 2 years ago

Note that the rules above are not correct. For example, this won't be lexed correctly: [1].