osa1 / lexgen

A fully-featured lexer generator, implemented as a proc macro
MIT License
63 stars 7 forks source link

Provide a way to exclude characters #24

Closed osa1 closed 2 years ago

osa1 commented 2 years ago

I can only find hypothetical cases for this feature, but maybe there are some real-world use cases too, so opening an issue to document those cases and implement the feature if needed.

Suppose I have a syntax like this: X can be followed by any character, except Y. Currently I need to do something like this: [0..Y-1] | [Y+1..]. This is not going to scale when I exclude multiple characters.

Note that a rule like this does not work as expected:

lexer! {
    ...

    rule Init {
        ...
        X => |lexer| lexer.switch(LexerRule::MyRule),
    }

    rule MyRule {
        Y =? |lexer| { ... raise an error ... },
        _,
    }
}

This fails when it sees XY, instead of yielding lexeme X first, and then lexing rest of the string starting with Y.

Currently the only way to implement such rules is by explicitly providing accepted character ranges.

In practice I haven't encountered this case yet. In cases like Rust string literals where standalone \r is not allowed, it is an error to see disallowed characters, so the rule example above is exactly what we need. I don't know any cases where an unexpected character is not an error but a new lexeme.

osa1 commented 2 years ago

Here's a real world use case for this feature. Rust character literals can be lexed like this:

"'" _ "'" => |lexer| {
    let match_ = lexer.match_();
    lexer.return_(Token::Lit(Lit::Char(match_)))
},

// NB: Escaped double quote is valid!
"'\\" ('n' | 'r' | 't' | '\\' | '0' | '\'' | '"') "'" => |lexer| {
    let match_ = lexer.match_();
    lexer.return_(Token::Lit(Lit::Char(match_)))
},

"'\\x" $oct_digit $hex_digit "'" => |lexer| {
    // TODO: Check that the number is in range
    let match_ = lexer.match_();
    lexer.return_(Token::Lit(Lit::Char(match_)))
},

"'\\u{" $hex_digit+ "}'" => |lexer| {
    // TODO: Check that there's at most 6 digits
    let match_ = lexer.match_();
    lexer.return_(Token::Lit(Lit::Char(match_)))
},

This accepts valid characters, but it also accepts some invalid ones, like '\n'. Rust character literal syntax disallows these characters: \n, \r, \t, \. These characters need to be escaped with an \.

(I think the first rule should also allow invalid ''')

To implement this properly, we need to switch to a new rule set where we explicitly reject \n, \t, etc. and accept the rest.

If we had a way to exclude characters, we could do something like:

"'" (_ # ('\t' | '\n' | '\t' | '\\' | '\'')) "'" => |lexer| {
    let match_ = lexer.match_();
    lexer.return_(Token::Lit(Lit::Char(match_)))
},