Open maciejhirsz opened 4 years ago
Lexer<str>
overLexer<Token>
Personally, just in terms of desirable API, I'd prefer this.
However, I don't think logos can use it. Why? Because Token::Extras
is stored in the lexer as well, and supports being stateful. And stateful Token::Extras
is required for some of the use cases that it advertises support for.
For example, consider lexing Python code. In order to support whitespace-based nesting, you need to remember what level of whitespace indent you're at.
With the current design of logos, I'd probably have a linebreak token \n\s*
, and then lexer.extras
would record the current indentation stack. This is inherently stateful, and requires the lexer definition to be parameterized by the token definition so it knows what kind of state to be stored.
Of course, Lexer<str>
vs Lexer<Token>
can be a different question than -> Token
vs -> ()
for the actual lexing step. But I think the lexer needs to remain parameterized by the token kind to maintain current functionality.
lexer.advance(Mode)
This is pretty much the example of modal lexer API. Whether tokens are all put on one enum or split between multiple enums comes down to the user's preference for surviving large type swamps and desire for type safety of not having impossible results listed as possible.
ANTLR's lexer mode support uses the same enumeration for all token kinds (and all parser node kinds), and thus also requires that if two modes can parse the same token that the token definition is duplicated between the modes. This is not great, and a common stumbling block.
However, I think using lexer modes to make only expected tokens parseable is an abuse of lexer modes. You don't use a lexer because you're aiming for absolute performance; a scannerless parser will typically win out here (because it naturally only checks for what it expects and all of parsing hits the same code, keeping it more in cache). Using a lexer is more about simplifying the grammar definition (the parser accepts tokens, not arbitrary text) and improving behavior predicability in failure cases.
Stateful lexing
The above said, I still like the idea of a mono-typed Lexer<str>
for different Token
types. I'm strongly in favor of lexing sticking as close to the ideal stateless fn(&str) -> Option<{ kind: TokenKind, len: usize }>
as possible.
So how do we maintain support for stateful lexing?
We wouldn't support Token::Extras
on the lexer at all. Instead, encourage use of kind-attached state for transient state, and composition for persistent state. To count the number of tokens of a kind, rather than keeping count in Token::Extras
, wrap the lexer in your own stateful iterator that does the counting.
I think the Extras
could be solved by having Lexer<'s, Source, Extras = ()>
, and enforcing type match with bounds on advance
. UX of that might be unwieldy though.
(Side note: full control, I'd just remove special trivia handling from logos, and tell any user of trivia that they should specify it explicitly and wrap in an iterator that filters it out.)
Just spitballing:
trait Token {
type Source: ?Sized + logos::Source;
fn lex<'s>(&'s Self::Source) -> Option<(Token, usize)>;
}
struct Lexer<'s, Source: ?Sized> {
source: &'s source,
pos: usize,
}
impl<'s, Source: ?Sized + logos::Source> Lexer<'s, Source> {
fn advance<Token: logos::Token>(&mut self) -> Option<(Token, Span)> {
loop {
let pos = self.pos;
let (token, length) = Token::lex(self.source.slice(pos..))?;
self.pos += length;
if Token::is_trivia(&token) { continue; } // tail recursion as a loop
else { return Some((token, pos..self.pos)); }
}
}
}
I quite like this API, though it would need some tweaking:
Option<(Token, NonZeroUsize)>
can fit into registers (for C-style enumerations) rather than return on the stack!pos: usize
in the lexer, so lexing at the ABI level can be just fn (usize, usize) -> (usize, usize)
Useful note here: Token::lex
is the fn (usize, usize) -> (usize, usize)
in this sketch here. (I think any callbacks that Token::lex
calls itself (the #103 case) should be taking a Cursor
type rather than Lexer
directly, though, again, Extras
complicates things.) So as a user I could use Lexer
for an "iterator like" API, or Token
for the minimal "lex front" API.
I'm already thinking about dropping trivia completely. Since 0.11.0-rc4 you can achieve trivia with logos::skip
:
#[derive(Logos)]
enum Token {
#[regex(r"[ \n\f\t]+", logos::skip)]
#[error]
Error,
// ...
}
I like the idea of doing a cursor based Lexer, and just proving a wrapper for the iterator. Option<(Token, usize)>
is already 2 words wide for C-style enum Token
. It could work if lex is fn<E>(&str, E) -> Option<(Token, usize)>
where E
is either ()
or &mut Extras
(having extras obviously screwing the performance in this case, since it adds an extra word, but oh well).
I've a use case where this would be super useful in Ramhorns, and I think I'll go with following:
let lex = Token1::lexer("...");
let tok2 = lex.next_as::<Token2>();
Main problem here is that Lexer
currently stores a copy of the token before emitting it, so would have to try to move back to return value on lexing, and figure the performance degradation there. For Extras
I can just enforce that Token1::Extras = Token2::Extras
.
The current state:
pub fn new(source: &'source Token::Source) -> Self {
Lexer {
source,
token: ManuallyDrop::new(None),
extras: Default::default(),
token_start: 0,
token_end: 0,
}
}
fn next(&mut self) -> Option<Token> {
self.token_start = self.token_end;
Token::lex(self);
// This basically treats self.token as a temporary field.
// Since we always immediately return a newly set token here,
// we don't have to replace it with `None` or manually drop
// it later.
unsafe { ManuallyDrop::take(&mut self.token) }
}
So just use the return spot of Token::lex
instead of a temporary spot in Lexer
.
(I still think that the raw API provided on token should probably be roughly isomorphic to (&str) -> (Token, usize)
. Callbacks then work with Cursor<'_, str>
instead of Lexer<Token>
. That would look like
const fn new(src: Self::Source) -> Lexer<Extras=()> {
Lexer::new_with(src, ()) // extras are special, serve the normal case
}
const fn new_with(src: Self::Source, extras: Self::Extras) -> Lexer {
Lexer {
src,
extras,
token_start: 0,
token_end: 0,
}
}
fn next::<Token>(&mut self) -> Option<Token>
where Token: logos::Token<Source: Self::Source, Extras=Self::Extras>
{
self.token_start = self.token_end;
let (token, len) = Token::lex(self.src, /* extras somehow? */);
self.src = &self.src.get_unchecked(len..);
self.token_end = self.token_start + len;
token
}
fn iter::<Token>(&mut self) -> impl Iterator<Item=Token>
where Token: logos::Token<Source: Self::Source, Extras=Self::Extras>
{
iter::from_fn(|| self.next())
}
fn iter_spanned::<Token>(&mut self) -> impl Iterator<Item=(Token, Span)>
where Token: logos::Token<Source: Self::Source, Extras=Self::Extras>
{
iter::from_fn(|| self.next().map(|token| (token, self.span()))
}
obviously with performance measurements along the way to make sure the shifting of responsibilities doesn't degrade performance.)
@CAD97 I like that design a lot.
I managed to get my thing to work. That moment my parsing performance in Ramhorns jumped from 500 to 800+mb/s was totally worth all the pain :).
I had to jump through some hoops to get this to work properly. I had to replace recursion with manual stack tracking so that I could use morph
, after that it was smooth sailing, but Logos is definitely in a dire need of a more ergonomic context switch mechanism.
@maciejhirsz the switch from recursion to a manual stack probably also helps throughput; I can't say for sure, and intuition about performance is often wrong, but my intuition says that replacing recursion with loops tends to help out performance.
With a cheap-Clone
lexer, you can also still do recursion only with morph
. It would look vaguely like:
fn outer(lex: &mut Lexer<Outer>) -> _ {
let _ = lex.next();
let _ = lex.next();
let mut morphed = lex.clone().morph();
let _ = inner(&mut morhped);
*lex = morphed.morph();
let _ = lex.next();
}
fn inner(lex: &mut Lexer<Inner>) -> _ {
let _ = lex.next();
let _ = lex.next();
}
It might not be as convenient as a real polymorphic next
, but it works.
I'm looking into using logos for generating a lexer and as many lexers do, it requires some state tracking/multiple entry points.
There are mainly two ways of keep track of state, multiple Token enums as the state marker vs a single Token enum with some attached state maker.
Using multiple Token enums might not be ideal, since some Token variants might be shared across multiple states, but it will give some safety and clarity which Tokens can be achieved at a given state.
Using a single Token enum with the attached state have the benefit of Token variants can be shared across state and can be marked for consumption doing multiple states and a given anonymous function can be responsible for the state change after a given Token was produced such that the state change can be dependent on the produced token value.
#[derive(Logos)]
enum Token {
// |token| Statement; changes the internal state
#[regex("[a-zA-Z_][a-zA-Z0-9_]*", context(Expression, FunctionName), |_token, _state| Statement)]
Identifier,
#[token("function", context(Statement))]
FunctionKeyword
#[token("(", context(Expression, FunctionDeclaration))]
OpenParen,
// ...
}
Having token production being in charge of state changes would be safe as long as there is no overlap(priority collision) at a given states tokens it should all be safe and the generated automata should be deterministic.
enum State {
Call,
Scope(usize),
}
struct Lexer<I, S, T> {
pos: usize,
input: I,
state: S,
token: PhantomData<T>,
}
impl<I, S, T> Iterator for Lexer<I, S, T> where T: Logos::Token {
type Item = T;
fn next(&mut self) -> Self::Item {
let (token, end_pos, new_state) <T as Logos::Token>::advance(&self.input, pos, state)?;
self.pos = end_pos;
self.state = new_state;
token
}
}
The Logos::Token::advance
would have to match the current state and then preceded when normal lexing. When done it would have to call the given state change function, such to determine which new_state to return. The new state can also depend on the previus state, such to allow tracking of scope nesting. It might be a good idea to return the new state as an Option
such we have a generic way of not changing state.
I would have liked to track the state by using PhantomData
, but that would require ownership since we cant bubble the type change.
This is the best solution I came using the current version of Logos, it uses morph()
and enum variants std::mem::take
.
https://gist.github.com/Defman/aa16dba8cc418ad95ad5b6d806588c5a
@Defman @maciejhirsz hey I'm trying to write a lexer along similar lines as above, where I need context-dependent lexing of a token. Would this be still the best way to go? Like having an extra enum (Token
) that has variants corresponding to all logos token enum (Call
and Scope
) variants. I couldn't seem to figure out any better way 😅
thanks!
@theNvN without looking at the code you're trying to write I can't tell, but in general using morph
to switch back and forth between token types is how this is supposed to be done. Live example in Ramhorns (note how lex
is morphed by value, and then reassigned at the end).
Just to add more alternatives, I will probably be using something among the lines of this (based on CAD97's suggestion):
#[extension(pub(crate) trait LexerExt)]
impl<'a, T: Logos<'a>> Lexer<'a, T>
where
T: Clone,
T::Extras: Clone,
{
#[inline]
fn morphed<T2, R>(&mut self, fun: fn(&mut Lexer<'a, T2>) -> R) -> R
where
T2: Logos<'a, Source = T::Source>,
T2::Extras: From<T::Extras>,
T2::Extras: Into<T::Extras>,
{
let mut tmp = self.clone().morph();
let res = fun(&mut tmp);
*self = tmp.morph();
res
}
}
fn outer(lex: &mut Lexer<Outer>) -> _ {
let _ = lex.next();
let _ = lex.next();
let _ = lex.morphed(inner);
let _ = lex.next();
}
fn inner(lex: &mut Lexer<Inner>) -> _ {
let _ = lex.next();
let _ = lex.next();
}
Any updates on this?
I would add another use case, for parsing ocaml-styled comments (i.e. the comments are nestable), this feature is needed.
Logos currently features the
morph
method, courtesy of @CAD97 (#65), which allows you to switch theLexer
from producing one type of a token, to another. This is neat, but it requires you to take the ownership of theLexer
you are morphing and create a temporary variable to store the new variant, before finally morphing back to original state. This should be mostly if not entirely optimized away, however it's still relatively unwieldy, and not very flexible in situations where you might only have a&mut
ref to theLexer
(if it's a field on someParser
you are writing, or inside the callbacks).Proposal 1:
fn advance
generic overT: Logos
If I can manage to make the derived lexing use the return value instead of storing the token temporarily on the Lexer without losing performance (which isn't obvious to me that I can right now). We could in principle decouple the
Lexer
from theLogos
trait entirely, that is:Lexer<'source, Token: Logos>
(Lexer<Token>
with lifetime elision).Lexer<'source, Source: Source>
(Lexer<str>
with lifetime elision).The new
Lexer
can then feature a genericadvance
method such as:This would mean that in places where you only expect a single token type when parsing, you could supply the lexer with a
Token
type that only matches that one token type, skipping a jump table on first byte and potentially improving parsing performance.The drawbacks of this approach as I see it:
Lexer
would no longer be anInterator
, but that's easy to remedy with a method that returns an iterator, not dissimilar tospanned
.Proposal 2:
fn advance
generic overT: Context
Alternatively, instead of defining multiple enums, we could still use a single enum (so
Lexer
can stay bound toT: Logos
and function asIterator
, but introduce a new concept to theLexer
calledContext
. AContext
would be a 0-sized struct used as a label to drive theLexer
from a different root node in the state machine, with the same goal of potentially eliminating expensive jump tables when only a single variant is expected. Some pseudocode:There is probably a lot more I'm missing, but the general idea here would be: if you are parsing some JS code which reads
function foo(...
, you'd do, in order:lex.advance(Statement)
, which would produceSome(Token::FunctionName)
.lex.advance(FunctionName)
, which would produceSome(Token::Identifier)
as in a happy path (a miss would still result in falling back to the jump table).lex.advance(FunctionDeclaration)
, which would produceSome(Token::OpenParen)
, again, happy path avoids the jump table.This gets rids of all cons of using multiple different enums, however it might be much harder to explain why it's useful and how, and at some level it's a very leaky abstraction that aims to shave couple ns in the lexing process at best. The advantage here is that you could start by using
.next()
everywhere, and incrementally as it becomes clear that in some places you always only expect an open parenthesis, or a semicolon, you could add a context for that case and win some performance.