[Feature request] optional end_of_input token just before Lexer starts returning None

maciejhirsz / logos

Create ridiculously fast Lexers

https://logos.maciej.codes

Apache License 2.0

2.9k stars 118 forks source link

[Feature request] optional end_of_input token just before Lexer starts returning None #328

Open legeana opened 1 year ago

legeana commented 1 year ago

It would be really convenient to have an ability to inject custom EndOfInput token, just before the lexer starts to return None.

  #[logos(error = LexerError)]
  #[logos(extras = LineTracker)]
  #[logos(skip r"#.*")] // comments
  pub enum Token {
      #[end_of_input]
      EndOfInput,
      #[token("\n")]
      Newline,
  }

For some shell-like grammars where statements terminated by a newline having EndOfInput, or even injecting the Newline itself at the end, can make parsing unterminated trailing statements much easier, because you can define a Statement = Command+ (Newline | EndOfInput).

Without this feature I just made a wrapper that returns one additional token after Logos returned None.

jeertmans commented 1 year ago

Hello, thanks for your suggestion!

Performance wise, I don't see any preference over using the Iterator::chain method:

#[derive(Debug)]
enum Token {
    A,
    B,
    C,
    EOF,
}

fn main() {

    use Token::*;

    let mut lexer = vec![A, B, C, A, B, C]
        .into_iter()
        .chain(Some(EOF));

    while let Some(token) = lexer.next() {
        println!("{:?}", token);
    }
}

I understand that this requires to manually add the last token using chain, but I don't think Logos can actually do something better than that :-/

maciejhirsz commented 1 year ago

I think we could handle that, I'll keep that in mind when I get to coding!

legeana commented 1 year ago

Hello, thanks for your suggestion!

Performance wise, I don't see any preference over using the Iterator::chain method:
#[derive(Debug)]
enum Token {
    A,
    B,
    C,
    EOF,
}

fn main() {

    use Token::*;

    let mut lexer = vec![A, B, C, A, B, C]
        .into_iter()
        .chain(Some(EOF));

    while let Some(token) = lexer.next() {
        println!("{:?}", token);
    }
}
I understand that this requires to manually add the last token using chain, but I don't think Logos can actually do something better than that :-/

My feeling is if you use chain you lose the logos::Lexer type, so you can't easily access lexer.span(), lexer.slice() and lexer.extras anymore: pub struct Chain<A, B> { / private fields / }. Having this function as part of logos makes a difference.