maciejhirsz / logos

Create ridiculously fast Lexers
https://logos.maciej.codes
Apache License 2.0
2.87k stars 116 forks source link

Allowing multiple token enum types #39

Closed mikolajpp closed 4 years ago

mikolajpp commented 5 years ago

I am looking around Logos source code and checking out how feasible it would be to have this work.

While having single token type is perfectly fine for small grammars, I am working with a language where number of tokens goes into hundreds. I have a custom lexer now, but being able to use the extremely nice API of Logos (seriously, and the code is some of the cleanest rust I have seen!) is too attractive not to try.

From what I understand, the way derive macros work is by being invoked type by type each time derive X is included. In this framework it would be quite cumbersome to extend logos in such a way as we would like to be able to iterate over all enum's that have derived Logos in a particular module (especially to see all tokens at once). Perhaps there is an easy way to approach this, but if not I will gladly try to bite it either way, as the verbosity of my current lexer keeps me at night ;-)

mikolajpp commented 5 years ago

Alternatively such cases could be handled by having "low-level" token enum type such eg. "LogosToken" and then simply transcribing this type to a more complex enum. However if possible at all, having Logos able to handle this would be great.

mikolajpp commented 5 years ago

After reading more about derive proc macros, such change seems fundamentally at odds with how macros operate at all. How about this then:

  1. Define one "assembly" token type, LogosToken
  2. Define proper token types as suitable for a particular language, with top level being called MyToken
  3. Have lexer accept a type parameter: lexer<Tok>(...) and define conversion traits from LogosToken to MyToken, and actually store MyToken within lexer already, thus not requiring additional post processing step.

In view of how macros work, it seems like a correct solution.

maciejhirsz commented 5 years ago

Thanks for the feedback :).

In general, having hundreds of token variants should be quite fine. I've been running this with 122 token variants for Lunarity (Solidity, a token for each operator, punctator, keyword etc.), and plan to convert Ratel (JS) to do the same (it's not using Logos yet, but has a Lexer that works on virtually the same principles with token enum having 108 variants).

When it comes to actually consuming the tokens, having conversions in the parser (to, say, extract an operator enum out of the token) is quite okay and can be optimized very well.

For example, handing nested (binary or suffix) expressions in Lunarity is now done using the lookup! macro to create a nice looking lookup table for tokens. This resolves token branching and figures out operator precedence at constant O(1) time, and the handler function that is being called there already knows which operator variant to inject into the expression.

BUT

You can also do something different that will keep your code very similar to what it is by exploiting a feature I just added - callbacks per token definition. It would look something like this (hopefully without typos):

#[derive(Default)]
struct SubTokens {
    pub binary: BinaryOpToken,
}

// Will need to give this a `Default` impl so that above compiles
enum BinaryOpToken {
    Add, 
    Mul, 
    Div, 
    Sub,
    Waw, 
}

#[derive(Logos, Debug, PartialEq)]
#[extras = "SubTokens"] 
enum Token {
    #[end]
    End,

    #[error]
    Error,

    #[token("+", callback = "op_add")]
    #[token("*", callback = "op_mul")]
    #[token("/", callback = "op_div")]
    #[token("-", callback = "op_sub")]
    #[token("&", callback = "op_smok_wawelski")] // ;)
    BinaryOp, 

    #[regex="\\d+"]
    Number,
}

fn op_add<S>(lex: &mut Lexer<Token, S>) {
    lex.extras.binary = BinaryOpToken::Add;
}

fn op_mul<S>(lex: &mut Lexer<Token, S>) {
    lex.extras.binary = BinaryOpToken::Mul;
}

fn op_div<S>(lex: &mut Lexer<Token, S>) {
    lex.extras.binary = BinaryOpToken::Div;
}

fn op_sub<S>(lex: &mut Lexer<Token, S>) {
    lex.extras.binary = BinaryOpToken::Sub;
}

fn op_smok_wawelski<S>(lex: &mut Lexer<Token, S>) {
    lex.extras.binary = BinaryOpToken::Waw;
}

This will put the SubTokens struct as the extras field on the Lexer, and call a callback for each of the definitions, which then sets proper BinaryOpToken variant on lex.extras.binary. In your parser you can then check that for "+" the token will be set to Token::BinaryOp, and the extras.binary on will be set to BinaryOpToken::Add. No nesting, and it's a bit verbose, but it will get the job done.

mikolajpp commented 5 years ago

Thanks for the fast reply. It seems perhaps I could use one token type as well. I did not do so out of rather aesthetic concerns, and partly anticipating sharing some types with AST structures, but looking around lunarity it does look very clean and pretty much the same I would like it to look, plus lookup macro looks great.

So all in all I would consider this issue resolved, feel free to close it/put under consideration;

maciejhirsz commented 5 years ago

Cool, glad to help :). I'll flag it until I get time to do proper documentation around the crate.

maciejhirsz commented 4 years ago

I believe this can be closed now that 0.11 is out.

erri120 commented 2 years ago

The above solution works for 0.12.0 with a few changes but the idea is the same:

#[derive(Default)]
pub struct SubTokens {
    pub operator: Option<OperatorKind>,
}

#[derive(Logos, Debug, PartialEq)]
#[logos(extras = SubTokens)]
pub enum LogosToken {
    #[token("=", callback = operator_assignment)]
    Operator,

    #[error]
    Error,
}

fn operator_assignment(lex: &mut logos::Lexer<LogosToken>) {
    lex.extras.operator = Some(OperatorKind::Assignment);
}

#[cfg(test)]
mod test {
    use crate::{LogosToken, OperatorKind};
    use logos::Logos;

    #[test]
    fn test_lexer() {
        let mut lex = LogosToken::lexer("=");
        assert_eq!(lex.next(), Some(LogosToken::Operator));
        assert_eq!(lex.extras.operator, Some(OperatorKind::Assignment))
    }
}

This compiles and the test is successful however, like in the original issue, I just want to write this:

#[derive(Logos, Debug, PartialEq)]
pub enum LogosToken {
    Operator(OperatorKind),

    #[error]
    Error,
}

I currently have 29 operators and 43 keywords so I'd have to write 72 of these little callback functions. Another pain point is the fact that I have to deal with the optional value in the extra struct SubToken when I match an operator or keyword.

If nothing speaks against it, I'd try and add a new attribute #[nested] or similar so Logos can implement all of this itself:

#[derive(Logos, Debug, PartialEq)]
pub enum LogosToken {
    #[nested]
    Operator(OperatorKind),

    #[error]
    Error,
}

#[derive(Logos, PartialEq, Debug)]
pub enum OperatorKind {
    #[token("=")]
    Assignment,
    //...
    #[error]
    Error,
}

The issue with this approach is that OperatorKind also derives from Logos so it can use the #[token] attribute.

erri120 commented 2 years ago

After experimenting a bit more I found out that this also works:

#[derive(Logos, Debug, PartialEq)]
pub enum Token {
    #[token("(", callback = |_| OperatorKind::ParenthesisOpen)]
    Operator(OperatorKind),
}