osa1 / lexgen

A fully-featured lexer generator, implemented as a proc macro
MIT License
63 stars 7 forks source link

Provide a way to match end-of-input in non-initial states #13

Closed osa1 closed 2 years ago

osa1 commented 3 years ago

(See also #12 for another different behavior in initial and non-initial states)

Suppose I want to lex C-style single line comments: // ...

lexer! {
    Lexer -> &'input str;

    rule Init {
        "//" => |lexer| {
            lexer.switch(LexerRule::SingleLineComment)
        },
    }

    rule SingleLineComment {
        '\n' => |lexer| {
            let comment = lexer.match_();
            lexer.switch_and_return(LexerRule::Init, comment)
        },

        _,
    }
}

This won't lex EOF-terminated comments because _ does not match EOF:

#[cfg(test)]
fn ignore_pos<A, E>(ret: Option<Result<(usize, A, usize), E>>) -> Option<Result<A, E>> {
    ret.map(|res| res.map(|(_, a, _)| a))
}

#[test]
fn comment() {
    let input = "// test";
    let mut lexer = Lexer::new(input);
    assert_eq!(ignore_pos(lexer.next()), Some(Ok(input))); // fails
    assert_eq!(ignore_pos(lexer.next()), None);
}

I don't know if we should make _ match EOF, or have another symbol for matching EOF explicitly.

osa1 commented 3 years ago

I think we will need a special symbol, maybe eof, to match end-of-input.

The question is whether to make it a regex, or a LHS.

If we make it a regex then we allow nonsensical regex like eof+ 'a' (match one or more "end of input", then character 'a') so I don't like this too much.

If we make it a LHS then it will be similar to _ in how we use it and handle it in the implementation. The example above will look like:

lexer! {
    Lexer -> &'input str;

    rule Init {
        "//" => |lexer| {
            lexer.switch(LexerRule::SingleLineComment)
        },
    }

    rule SingleLineComment {
        '\n' => |lexer| {
            let comment = lexer.match_();
            lexer.switch_and_return(LexerRule::Init, comment)
        },

        eof => |lexer| {
            let comment = lexer.match_();
            lexer.switch_and_return(LexerRule::Init, comment)
        }

        _,
    }
}

Since we cannot do '\n' | eof (because eof is not a regex) this has a little bit duplication, but I think it's not too bad.

Note that we don't need to match eof in the Init rule, as we have a special case in Init and handle eof to return None in the next method.