Workaround for lack of word boundaries

ceronman commented 3 weeks ago

Hi,

This is just a question, not a bug report. I'm using Logos to write a lexer for the C programming language. I'm following the book "Writing a C Compiler" by Nora Sandler.

Something unusual from the book, is that the lexical grammar of the lexer uses word boundaries. So for example there are identifiers and constants:

Identifier: [a-zA-Z_]\w*\b
Constant: [0-9]+\b

Now the reason for using word boundaries is an input like 123foo should be a lexical error. I find this a bit unusual, because I think this kind of error is usually caught at the parser level, but here we're trying to catch it at lexer level.

Using Logos I initially had something like this:

#[derive(Logos)]
enum TokenKind {
    // ...
    #[regex(r"[a-zA-Z_]\w*")]
    Identifier,

    #[regex(r"[0-9]+")]
    Constant,
    //..
}

But this of course has the problem that it would not catch the error mentioned above. I'm wondering what could be the best workaround for this use case. Currently, what I'm doing is the following:

#[derive(Logos)]
enum TokenKind {
    // ...
    #[regex(r"[a-zA-Z_]\w*")]
    Identifier,

    #[regex(r"[0-9]+")]
    Constant,

    #[regex(r"[0-9]+[a-zA-Z_]+", |_| None)]
    BadIdentifier,
    //..
}

This kinda works, but it looks a bit ugly to me. I'm using a callback that returns None to signal the error. I'm wondering if there is a better way to do that. I saw some examples using #[error], but it seems that doesn't work in newer versions of Logos.

Are there any better ideas?

Thanks in advance.

jeertmans commented 3 weeks ago

Hello @ceronman, thank you for this interesting question!

I think how you did is good, but I just suggest another way: define the whole grammar with tokens/regexes, and let the lexer return Err when it encounters something that does not match any valid token. I recommend you take a look at the JSON parser example that shows how you can create a complex logic when parsing elements. Ultimately, the parser will need some help to return a meaningful error if you want to indicate that you found an invalid identifier, and this usually needs to know the current context (i.e., that you were expecting an identifier).

martinstuder commented 1 week ago

Hello @ceronman, thank you for this interesting question!

I think how you did is good, but I just suggest another way: define the whole grammar with tokens/regexes, and let the lexer return Err when it encounters something that does not match any valid token. I recommend you take a look at the JSON parser example that shows how you can create a complex logic when parsing elements. Ultimately, the parser will need some help to return a meaningful error if you want to indicate that you found an invalid identifier, and this usually needs to know the current context (i.e., that you were expecting an identifier).

@jeertmans I think @ceronman's question is how you can make sure that an input like 123foo does not yield TokenKind::Constant and then TokenKind::Identifier in the lexing phase which is what Logos would do and is actually a problem I ran into recently myself. What I would somehow like to achieve is that Logos would give a lexing error if a sequence of characters have not been fully consumed up to the next token delimiter (whatever defines a token delimiter, e.g. whitespace). An approach that I am using at the moment is to have a callback that uses Lexer::remainder() to check that there are no remaining non-whitespace characters and raise an error if there are. That brings me to another thing: if Lexer were peekable (it isn't unfortunately), you could peek for the next token in a callback and make sure that it is an expected delimiting token.

jeertmans commented 1 week ago

Well @martinstuder another solution is to have a lexer with tokens for every possible case you want to cover, and you can use callbacks if you want to return errors instead of tokens. Here, in this example, you need to explicitly write a token that matches wrong identifier. This issue is that you then need to handle special cases, like strings or code comments, that usually require very complex regexes if you want to only rely on tokens.

maciejhirsz / logos

Workaround for lack of word boundaries #415