Context-dependent lexing

maciejhirsz commented 4 years ago

Logos currently features the morph method, courtesy of @CAD97 (#65), which allows you to switch the Lexer from producing one type of a token, to another. This is neat, but it requires you to take the ownership of the Lexer you are morphing and create a temporary variable to store the new variant, before finally morphing back to original state. This should be mostly if not entirely optimized away, however it's still relatively unwieldy, and not very flexible in situations where you might only have a &mut ref to the Lexer (if it's a field on some Parser you are writing, or inside the callbacks).

Proposal 1: `fn advance` generic over `T: Logos`

If I can manage to make the derived lexing use the return value instead of storing the token temporarily on the Lexer without losing performance (which isn't obvious to me that I can right now). We could in principle decouple the Lexer from the Logos trait entirely, that is:

Instead of Lexer<'source, Token: Logos> (Lexer<Token> with lifetime elision).
Use Lexer<'source, Source: Source> (Lexer<str> with lifetime elision).

The new Lexer can then feature a generic advance method such as:

fn advance<Token: Logos>(&mut self) -> Option<Token>

This would mean that in places where you only expect a single token type when parsing, you could supply the lexer with a Token type that only matches that one token type, skipping a jump table on first byte and potentially improving parsing performance.

The drawbacks of this approach as I see it:

Lexer would no longer be an Interator, but that's easy to remedy with a method that returns an iterator, not dissimilar to spanned.
I can see that when parsing something like JavaScript, you'd have to duplicate a lot of definitions across many different enums since a lot of tokens can appear in different contexts.
If you do specialized lexing and get an unexpected token, you no longer know what type of unexpected token you got since it would always be an error variant.

Proposal 2: `fn advance` generic over `T: Context`

Alternatively, instead of defining multiple enums, we could still use a single enum (so Lexer can stay bound to T: Logos and function as Iterator, but introduce a new concept to the Lexer called Context. A Context would be a 0-sized struct used as a label to drive the Lexer from a different root node in the state machine, with the same goal of potentially eliminating expensive jump tables when only a single variant is expected. Some pseudocode:

struct Statement;
struct Expression;
struct FunctionName;
struct FunctionDeclaration;

#[derive(Logos)]
enum Token {
    #[regex("[a-zA-Z_][a-zA-Z0-9_]*", context(Expression, FunctionName))]
    Identifier,

    #[token("function", context(Statement))]
    FunctionKeyword

    #[token("(", context(Expression, FunctionDeclaration))]
    OpenParen,

    // ...
}

There is probably a lot more I'm missing, but the general idea here would be: if you are parsing some JS code which reads function foo(..., you'd do, in order:

lex.advance(Statement), which would produce Some(Token::FunctionName).
lex.advance(FunctionName), which would produce Some(Token::Identifier) as in a happy path (a miss would still result in falling back to the jump table).
lex.advance(FunctionDeclaration), which would produce Some(Token::OpenParen), again, happy path avoids the jump table.

This gets rids of all cons of using multiple different enums, however it might be much harder to explain why it's useful and how, and at some level it's a very leaky abstraction that aims to shave couple ns in the lexing process at best. The advantage here is that you could start by using .next() everywhere, and incrementally as it becomes clear that in some places you always only expect an open parenthesis, or a semicolon, you could add a context for that case and win some performance.

CAD97 commented 4 years ago

Lexer<str> over Lexer<Token>

Personally, just in terms of desirable API, I'd prefer this.

However, I don't think logos can use it. Why? Because Token::Extras is stored in the lexer as well, and supports being stateful. And stateful Token::Extras is required for some of the use cases that it advertises support for.

For example, consider lexing Python code. In order to support whitespace-based nesting, you need to remember what level of whitespace indent you're at.

With the current design of logos, I'd probably have a linebreak token \n\s*, and then lexer.extras would record the current indentation stack. This is inherently stateful, and requires the lexer definition to be parameterized by the token definition so it knows what kind of state to be stored.

Of course, Lexer<str> vs Lexer<Token> can be a different question than -> Token vs -> () for the actual lexing step. But I think the lexer needs to remain parameterized by the token kind to maintain current functionality.

CAD97 commented 4 years ago

lexer.advance(Mode)

This is pretty much the example of modal lexer API. Whether tokens are all put on one enum or split between multiple enums comes down to the user's preference for surviving large type swamps and desire for type safety of not having impossible results listed as possible.

ANTLR's lexer mode support uses the same enumeration for all token kinds (and all parser node kinds), and thus also requires that if two modes can parse the same token that the token definition is duplicated between the modes. This is not great, and a common stumbling block.

However, I think using lexer modes to make only expected tokens parseable is an abuse of lexer modes. You don't use a lexer because you're aiming for absolute performance; a scannerless parser will typically win out here (because it naturally only checks for what it expects and all of parsing hits the same code, keeping it more in cache). Using a lexer is more about simplifying the grammar definition (the parser accepts tokens, not arbitrary text) and improving behavior predicability in failure cases.

CAD97 commented 4 years ago

Stateful lexing

The above said, I still like the idea of a mono-typed Lexer<str> for different Token types. I'm strongly in favor of lexing sticking as close to the ideal stateless fn(&str) -> Option<{ kind: TokenKind, len: usize }> as possible.

So how do we maintain support for stateful lexing?

We wouldn't support Token::Extras on the lexer at all. Instead, encourage use of kind-attached state for transient state, and composition for persistent state. To count the number of tokens of a kind, rather than keeping count in Token::Extras, wrap the lexer in your own stateful iterator that does the counting.

maciejhirsz commented 4 years ago

I think the Extras could be solved by having Lexer<'s, Source, Extras = ()>, and enforcing type match with bounds on advance. UX of that might be unwieldy though.

CAD97 commented 4 years ago

(Side note: full control, I'd just remove special trivia handling from logos, and tell any user of trivia that they should specify it explicitly and wrap in an iterator that filters it out.)

Just spitballing:

trait Token {
    type Source: ?Sized + logos::Source;
    fn lex<'s>(&'s Self::Source) -> Option<(Token, usize)>;
}

struct Lexer<'s, Source: ?Sized> {
    source: &'s source,
    pos: usize,
}

impl<'s, Source: ?Sized + logos::Source> Lexer<'s, Source> {
    fn advance<Token: logos::Token>(&mut self) -> Option<(Token, Span)> {
        loop {
            let pos = self.pos;
            let (token, length) = Token::lex(self.source.slice(pos..))?;
            self.pos += length;
            if Token::is_trivia(&token) { continue; } // tail recursion as a loop
            else { return Some((token, pos..self.pos)); }
        }
    }
}

I quite like this API, though it would need some tweaking:

Callback support
Advancing without getting length data or getting just length, not span
If token length is guaranteed to be nonzero, then Option<(Token, NonZeroUsize)> can fit into registers (for C-style enumerations) rather than return on the stack!
I'd love to be able to drop the pos: usize in the lexer, so lexing at the ABI level can be just fn (usize, usize) -> (usize, usize)

Useful note here: Token::lex is the fn (usize, usize) -> (usize, usize) in this sketch here. (I think any callbacks that Token::lex calls itself (the #103 case) should be taking a Cursor type rather than Lexer directly, though, again, Extras complicates things.) So as a user I could use Lexer for an "iterator like" API, or Token for the minimal "lex front" API.

maciejhirsz commented 4 years ago

I'm already thinking about dropping trivia completely. Since 0.11.0-rc4 you can achieve trivia with logos::skip:

#[derive(Logos)]
enum Token {
    #[regex(r"[ \n\f\t]+", logos::skip)]
    #[error]
    Error,

    // ...
}

I like the idea of doing a cursor based Lexer, and just proving a wrapper for the iterator. Option<(Token, usize)> is already 2 words wide for C-style enum Token. It could work if lex is fn<E>(&str, E) -> Option<(Token, usize)> where E is either () or &mut Extras (having extras obviously screwing the performance in this case, since it adds an extra word, but oh well).

maciejhirsz commented 4 years ago

I've a use case where this would be super useful in Ramhorns, and I think I'll go with following:

let lex = Token1::lexer("...");

let tok2 = lex.next_as::<Token2>();

Main problem here is that Lexer currently stores a copy of the token before emitting it, so would have to try to move back to return value on lexing, and figure the performance degradation there. For Extras I can just enforce that Token1::Extras = Token2::Extras.

CAD97 commented 4 years ago

The current state:

    pub fn new(source: &'source Token::Source) -> Self {
        Lexer {
            source,
            token: ManuallyDrop::new(None),
            extras: Default::default(),
            token_start: 0,
            token_end: 0,
        }
    }

    fn next(&mut self) -> Option<Token> {
        self.token_start = self.token_end;

        Token::lex(self);

        // This basically treats self.token as a temporary field.
        // Since we always immediately return a newly set token here,
        // we don't have to replace it with `None` or manually drop
        // it later.
        unsafe { ManuallyDrop::take(&mut self.token) }
    }

So just use the return spot of Token::lex instead of a temporary spot in Lexer.

(I still think that the raw API provided on token should probably be roughly isomorphic to (&str) -> (Token, usize). Callbacks then work with Cursor<'_, str> instead of Lexer<Token>. That would look like

const fn new(src: Self::Source) -> Lexer<Extras=()> {
    Lexer::new_with(src, ()) // extras are special, serve the normal case
}
const fn new_with(src: Self::Source, extras: Self::Extras) -> Lexer {
    Lexer {
        src,
        extras,
        token_start: 0,
        token_end: 0,
    }
}

fn next::<Token>(&mut self) -> Option<Token>
where Token: logos::Token<Source: Self::Source, Extras=Self::Extras>
{
    self.token_start = self.token_end;
    let (token, len) = Token::lex(self.src, /* extras somehow? */);
    self.src = &self.src.get_unchecked(len..);
    self.token_end = self.token_start + len;
    token
}

fn iter::<Token>(&mut self) -> impl Iterator<Item=Token>
where Token: logos::Token<Source: Self::Source, Extras=Self::Extras>
{
    iter::from_fn(|| self.next())
}

fn iter_spanned::<Token>(&mut self) -> impl Iterator<Item=(Token, Span)>
where Token: logos::Token<Source: Self::Source, Extras=Self::Extras>
{
    iter::from_fn(|| self.next().map(|token| (token, self.span()))
}

obviously with performance measurements along the way to make sure the shifting of responsibilities doesn't degrade performance.)

maciejhirsz commented 4 years ago

@CAD97 I like that design a lot.

I managed to get my thing to work. That moment my parsing performance in Ramhorns jumped from 500 to 800+mb/s was totally worth all the pain :).

I had to jump through some hoops to get this to work properly. I had to replace recursion with manual stack tracking so that I could use morph, after that it was smooth sailing, but Logos is definitely in a dire need of a more ergonomic context switch mechanism.

CAD97 commented 4 years ago

@maciejhirsz the switch from recursion to a manual stack probably also helps throughput; I can't say for sure, and intuition about performance is often wrong, but my intuition says that replacing recursion with loops tends to help out performance.

With a cheap-Clone lexer, you can also still do recursion only with morph. It would look vaguely like:

fn outer(lex: &mut Lexer<Outer>) -> _ {
    let _ = lex.next();
    let _ = lex.next();
    let mut morphed = lex.clone().morph();
    let _ = inner(&mut morhped);
    *lex = morphed.morph();
    let _ = lex.next();
}

fn inner(lex: &mut Lexer<Inner>) -> _ {
    let _ = lex.next();
    let _ = lex.next();
}

It might not be as convenient as a real polymorphic next, but it works.

Defman commented 4 years ago

I'm looking into using logos for generating a lexer and as many lexers do, it requires some state tracking/multiple entry points.

There are mainly two ways of keep track of state, multiple Token enums as the state marker vs a single Token enum with some attached state maker.

Multiple tokens as state makers

Using multiple Token enums might not be ideal, since some Token variants might be shared across multiple states, but it will give some safety and clarity which Tokens can be achieved at a given state.

Single token with attached state

Using a single Token enum with the attached state have the benefit of Token variants can be shared across state and can be marked for consumption doing multiple states and a given anonymous function can be responsible for the state change after a given Token was produced such that the state change can be dependent on the produced token value.

#[derive(Logos)]
enum Token {
    // |token| Statement; changes the internal state
    #[regex("[a-zA-Z_][a-zA-Z0-9_]*", context(Expression, FunctionName), |_token, _state| Statement)]
    Identifier,

    #[token("function", context(Statement))]
    FunctionKeyword

    #[token("(", context(Expression, FunctionDeclaration))]
    OpenParen,

    // ...
}

Having token production being in charge of state changes would be safe as long as there is no overlap(priority collision) at a given states tokens it should all be safe and the generated automata should be deterministic.

enum State {
  Call,
  Scope(usize),
}
struct Lexer<I, S, T> {
  pos: usize,
  input: I,
  state: S,
  token: PhantomData<T>,
}

impl<I, S, T> Iterator for Lexer<I, S, T> where T: Logos::Token {
  type Item = T;
  fn next(&mut self) -> Self::Item {
    let (token, end_pos, new_state) <T as Logos::Token>::advance(&self.input, pos, state)?;
    self.pos = end_pos;
    self.state = new_state;
    token
  }
}

The Logos::Token::advance would have to match the current state and then preceded when normal lexing. When done it would have to call the given state change function, such to determine which new_state to return. The new state can also depend on the previus state, such to allow tracking of scope nesting. It might be a good idea to return the new state as an Option such we have a generic way of not changing state.

I would have liked to track the state by using PhantomData, but that would require ownership since we cant bubble the type change.

Defman commented 4 years ago

This is the best solution I came using the current version of Logos, it uses morph() and enum variants std::mem::take. https://gist.github.com/Defman/aa16dba8cc418ad95ad5b6d806588c5a

nvnx7 commented 2 years ago

@Defman @maciejhirsz hey I'm trying to write a lexer along similar lines as above, where I need context-dependent lexing of a token. Would this be still the best way to go? Like having an extra enum (Token) that has variants corresponding to all logos token enum (Call and Scope) variants. I couldn't seem to figure out any better way 😅 thanks!

maciejhirsz commented 2 years ago

@theNvN without looking at the code you're trying to write I can't tell, but in general using morph to switch back and forth between token types is how this is supposed to be done. Live example in Ramhorns (note how lex is morphed by value, and then reassigned at the end).

buffet commented 2 years ago

Just to add more alternatives, I will probably be using something among the lines of this (based on CAD97's suggestion):

#[extension(pub(crate) trait LexerExt)]
impl<'a, T: Logos<'a>> Lexer<'a, T>
where
    T: Clone,
    T::Extras: Clone,
{
    #[inline]
    fn morphed<T2, R>(&mut self, fun: fn(&mut Lexer<'a, T2>) -> R) -> R
    where
        T2: Logos<'a, Source = T::Source>,
        T2::Extras: From<T::Extras>,
        T2::Extras: Into<T::Extras>,
    {
        let mut tmp = self.clone().morph();
        let res = fun(&mut tmp);
        *self = tmp.morph();
        res
    }
}

fn outer(lex: &mut Lexer<Outer>) -> _ {
    let _ = lex.next();
    let _ = lex.next();
    let _ = lex.morphed(inner);
    let _ = lex.next();
}

fn inner(lex: &mut Lexer<Inner>) -> _ {
    let _ = lex.next();
    let _ = lex.next();
}

kwyntes commented 9 months ago

Any updates on this?

glyh commented 5 months ago

I would add another use case, for parsing ocaml-styled comments (i.e. the comments are nestable), this feature is needed.

maciejhirsz / logos