osa1 / lexgen

A fully-featured lexer generator, implemented as a proc macro
MIT License
63 stars 7 forks source link

Provide a way to specify error value when failing to match #36

Open osa1 opened 2 years ago

osa1 commented 2 years ago

This is related to #35 and we use the same example.

Suppose in b"\xa" I want to fail with "invalid hex escape".

With a "cut" operator as described in #35 the best we can have in a concise way is an "invalid token" error.

To raise a "invalid hex escape" error we need to use new rules. For example:

rule ByteString {
    "\\x" => |lexer| lexer.switch(LexerRule::ByteStringHexEscape),

    ($ascii_for_string | $byte_escape | $string_continue | "\r\n")* '"' => |lexer| {
        let match_ = lexer.match_();
        lexer.switch_and_return(LexerRule::Init, Token::Lit(Lit::ByteString(match_)))
    },
}

rule ByteStringHexEscape {
    $hex_digit $hex_digit => |lexer| lexer.switch(LexerRule::ByteString),
    $ | _ | _ _ =? |lexer| lexer.return_(Err(CustomError::InvalidHexEscape)),
}

The new rule ByteStringHexEscape matches two hex digits, and fails with InvalidHexEscape on everything else. Note that we don't want to match more than two characters here, so we have cases for end-of-stream ($), one character (_) and two characters (_ _). We can't do something like _* because that would match \xaaaa and fail with InvalidHexEscape.

(This is a case where a syntax for mathing between given numbers of occurrences would be useful, e.g. _{0,2} would expand to $ | _ | _ _. alex has this feature.)

It would be good to have a more concise way of failling with a given error. For example, in the definition of byte_escape:

let byte_escape = ("\\x" $hex_digit $hex_digit) | "\\n" | "\\r" | "\\t" | "\\\\" | "\\0" | "\\\"" | "\\'";

Maybe we could have something like:

let byte_escape = ("\\x" !CustomError::InvalidHexEscape $hex_digit $hex_digit)
            | "\\n" | "\\r" | "\\t" | "\\\\" | "\\0" | "\\\"" | "\\'";

where ! is the cut operator as described in #35, but when the match fails, instead of InvalidToken we now raise InvalidHexEscape.

One question is whether we also want a syntax for specifying the error value, without also adding a "cut". So far I didn't need this.