Provide a way to exclude characters

osa1 / lexgen

A fully-featured lexer generator, implemented as a proc macro

MIT License

63 stars 7 forks source link

I can only find hypothetical cases for this feature, but maybe there are some real-world use cases too, so opening an issue to document those cases and implement the feature if needed.

Suppose I have a syntax like this: X can be followed by any character, except Y. Currently I need to do something like this: [0..Y-1] | [Y+1..]. This is not going to scale when I exclude multiple characters.

Note that a rule like this does not work as expected:

lexer! {
    ...

    rule Init {
        ...
        X => |lexer| lexer.switch(LexerRule::MyRule),
    }

    rule MyRule {
        Y =? |lexer| { ... raise an error ... },
        _,
    }
}

This fails when it sees XY, instead of yielding lexeme X first, and then lexing rest of the string starting with Y.

Currently the only way to implement such rules is by explicitly providing accepted character ranges.

In practice I haven't encountered this case yet. In cases like Rust string literals where standalone \r is not allowed, it is an error to see disallowed characters, so the rule example above is exactly what we need. I don't know any cases where an unexpected character is not an error but a new lexeme.

Here's a real world use case for this feature. Rust character literals can be lexed like this:

"'" _ "'" => |lexer| {
    let match_ = lexer.match_();
    lexer.return_(Token::Lit(Lit::Char(match_)))
},

// NB: Escaped double quote is valid!
"'\\" ('n' | 'r' | 't' | '\\' | '0' | '\'' | '"') "'" => |lexer| {
    let match_ = lexer.match_();
    lexer.return_(Token::Lit(Lit::Char(match_)))
},

"'\\x" $oct_digit $hex_digit "'" => |lexer| {
    // TODO: Check that the number is in range
    let match_ = lexer.match_();
    lexer.return_(Token::Lit(Lit::Char(match_)))
},

"'\\u{" $hex_digit+ "}'" => |lexer| {
    // TODO: Check that there's at most 6 digits
    let match_ = lexer.match_();
    lexer.return_(Token::Lit(Lit::Char(match_)))
},

This accepts valid characters, but it also accepts some invalid ones, like '\n'. Rust character literal syntax disallows these characters: \n, \r, \t, \. These characters need to be escaped with an \.

(I think the first rule should also allow invalid ''')

To implement this properly, we need to switch to a new rule set where we explicitly reject \n, \t, etc. and accept the rest.

If we had a way to exclude characters, we could do something like:

"'" (_ # ('\t' | '\n' | '\t' | '\\' | '\'')) "'" => |lexer| {
    let match_ = lexer.match_();
    lexer.return_(Token::Lit(Lit::Char(match_)))
},

osa1 / lexgen

Provide a way to exclude characters #24