Closed mikolajpp closed 4 years ago
Alternatively such cases could be handled by having "low-level" token enum type such eg. "LogosToken" and then simply transcribing this type to a more complex enum. However if possible at all, having Logos able to handle this would be great.
After reading more about derive proc macros, such change seems fundamentally at odds with how macros operate at all. How about this then:
LogosToken
MyToken
lexer<Tok>(...)
and define conversion traits from LogosToken
to MyToken
, and actually store MyToken within
lexer already, thus not requiring additional post processing step. In view of how macros work, it seems like a correct solution.
Thanks for the feedback :).
In general, having hundreds of token variants should be quite fine. I've been running this with 122 token variants for Lunarity (Solidity, a token for each operator, punctator, keyword etc.), and plan to convert Ratel (JS) to do the same (it's not using Logos yet, but has a Lexer that works on virtually the same principles with token enum having 108 variants).
When it comes to actually consuming the tokens, having conversions in the parser (to, say, extract an operator enum out of the token) is quite okay and can be optimized very well.
For example, handing nested (binary or suffix) expressions in Lunarity is now done using the lookup!
macro to create a nice looking lookup table for tokens. This resolves token branching and figures out operator precedence at constant O(1)
time, and the handler function that is being called there already knows which operator variant to inject into the expression.
You can also do something different that will keep your code very similar to what it is by exploiting a feature I just added - callbacks per token definition. It would look something like this (hopefully without typos):
#[derive(Default)]
struct SubTokens {
pub binary: BinaryOpToken,
}
// Will need to give this a `Default` impl so that above compiles
enum BinaryOpToken {
Add,
Mul,
Div,
Sub,
Waw,
}
#[derive(Logos, Debug, PartialEq)]
#[extras = "SubTokens"]
enum Token {
#[end]
End,
#[error]
Error,
#[token("+", callback = "op_add")]
#[token("*", callback = "op_mul")]
#[token("/", callback = "op_div")]
#[token("-", callback = "op_sub")]
#[token("&", callback = "op_smok_wawelski")] // ;)
BinaryOp,
#[regex="\\d+"]
Number,
}
fn op_add<S>(lex: &mut Lexer<Token, S>) {
lex.extras.binary = BinaryOpToken::Add;
}
fn op_mul<S>(lex: &mut Lexer<Token, S>) {
lex.extras.binary = BinaryOpToken::Mul;
}
fn op_div<S>(lex: &mut Lexer<Token, S>) {
lex.extras.binary = BinaryOpToken::Div;
}
fn op_sub<S>(lex: &mut Lexer<Token, S>) {
lex.extras.binary = BinaryOpToken::Sub;
}
fn op_smok_wawelski<S>(lex: &mut Lexer<Token, S>) {
lex.extras.binary = BinaryOpToken::Waw;
}
This will put the SubTokens
struct as the extras
field on the Lexer
, and call a callback for each of the definitions, which then sets proper BinaryOpToken
variant on lex.extras.binary
. In your parser you can then check that for "+"
the token will be set to Token::BinaryOp
, and the extras.binary
on will be set to BinaryOpToken::Add
. No nesting, and it's a bit verbose, but it will get the job done.
Thanks for the fast reply. It seems perhaps I could use one token type as well. I did not do so out of rather aesthetic concerns, and partly anticipating sharing some types with AST structures, but looking around lunarity it does look very clean and pretty much the same I would like it to look, plus lookup macro looks great.
So all in all I would consider this issue resolved, feel free to close it/put under consideration;
Cool, glad to help :). I'll flag it until I get time to do proper documentation around the crate.
I believe this can be closed now that 0.11 is out.
The above solution works for 0.12.0 with a few changes but the idea is the same:
#[derive(Default)]
pub struct SubTokens {
pub operator: Option<OperatorKind>,
}
#[derive(Logos, Debug, PartialEq)]
#[logos(extras = SubTokens)]
pub enum LogosToken {
#[token("=", callback = operator_assignment)]
Operator,
#[error]
Error,
}
fn operator_assignment(lex: &mut logos::Lexer<LogosToken>) {
lex.extras.operator = Some(OperatorKind::Assignment);
}
#[cfg(test)]
mod test {
use crate::{LogosToken, OperatorKind};
use logos::Logos;
#[test]
fn test_lexer() {
let mut lex = LogosToken::lexer("=");
assert_eq!(lex.next(), Some(LogosToken::Operator));
assert_eq!(lex.extras.operator, Some(OperatorKind::Assignment))
}
}
This compiles and the test is successful however, like in the original issue, I just want to write this:
#[derive(Logos, Debug, PartialEq)]
pub enum LogosToken {
Operator(OperatorKind),
#[error]
Error,
}
I currently have 29 operators and 43 keywords so I'd have to write 72 of these little callback functions. Another pain point is the fact that I have to deal with the optional value in the extra struct SubToken
when I match an operator or keyword.
If nothing speaks against it, I'd try and add a new attribute #[nested]
or similar so Logos can implement all of this itself:
#[derive(Logos, Debug, PartialEq)]
pub enum LogosToken {
#[nested]
Operator(OperatorKind),
#[error]
Error,
}
#[derive(Logos, PartialEq, Debug)]
pub enum OperatorKind {
#[token("=")]
Assignment,
//...
#[error]
Error,
}
The issue with this approach is that OperatorKind
also derives from Logos
so it can use the #[token]
attribute.
After experimenting a bit more I found out that this also works:
#[derive(Logos, Debug, PartialEq)]
pub enum Token {
#[token("(", callback = |_| OperatorKind::ParenthesisOpen)]
Operator(OperatorKind),
}
I am looking around Logos source code and checking out how feasible it would be to have this work.
While having single token type is perfectly fine for small grammars, I am working with a language where number of tokens goes into hundreds. I have a custom lexer now, but being able to use the extremely nice API of Logos (seriously, and the code is some of the cleanest rust I have seen!) is too attractive not to try.
From what I understand, the way derive macros work is by being invoked type by type each time derive X is included. In this framework it would be quite cumbersome to extend logos in such a way as we would like to be able to iterate over all enum's that have derived
Logos
in a particular module (especially to see all tokens at once). Perhaps there is an easy way to approach this, but if not I will gladly try to bite it either way, as the verbosity of my current lexer keeps me at night ;-)