maciejhirsz / logos

Create ridiculously fast Lexers
https://logos.maciej.codes
Apache License 2.0
2.83k stars 112 forks source link

I'd like to volunteer! #291

Open agluszak opened 1 year ago

agluszak commented 1 year ago

I'd like to help with the rewrite mentioned here: https://github.com/maciejhirsz/logos/releases/tag/v0.13 :) I already have some experience with logos' internals, at least the "frontend" part.

maciejhirsz commented 1 year ago

Thank you!

So I had a bit of a think about this, I think the best way to proceed that will futureproof the rewrite for inlining would be to define states using a macro_rules! macro, something like:

fn lex(lex: &mut ::logos::Lexer<'s, Self>) {
    use logos::internal::{CallbackResult, LexerInternal};

    enum State {
        GoTo1,
        GoTo2,
        GoTo3,
    }

    macro_rules! goto {
        (GoTo1, $lex:ident) => {/* ... */};
        (GoTo2, $lex:ident) => {/* ... */};
        (GoTo3, $lex:ident) => {/* ... */};
    }

    let mut state = State::Goto3;

    // Assuming Goto3 is the initial state
    goto!(GoTo3, lex);

    loop {
        match state {
            State::GoTo1 => goto!(GoTo1, lex),
            State::GoTo2 => goto!(GoTo2, lex),
            State::GoTo3 => goto!(GoTo3, lex),
        }
    }
}

Just passing around the lex ident is likely not enough, I think we'll need at least one, possibly two variables in stack declared at the top of the function body to account for:

  1. A possible match that should be emitted instead of an error if current state doesn't find a match. It should replace this mess.
  2. usize value for lookahead that would replace the need to duplicate states possible combination of at and with that current macro is doing, as in:
    #[inline]
    fn goto3_at1_with3<'s>(lex: &mut Lexer<'s>) {
        match lex.read_at::<&[u8; 2usize]>(1usize) {
            Some(b"-z") => {
                lex.bump_unchecked(3usize);
                goto1_x(lex)
            }
            _ => _error(lex),
        }
    }

This should make the whole codegen much easier to reason about and debug, particularly in issues like #279. The whole passing context around can potentially be scrapped.

For starters I'd say do not care about performance. When I first got the 0.10 rewrite working it was chugging at maybe 20% of what it ended up running at eventually, so feel free to skip any complexity you feel makes things difficult, we can always add optimizations later once tests are green. Even then I think I'd be fine if the final product of the rewrite is tiny bit slower if it means it's actually 100% correct.

Let me know if you need any more pointers, or if you have any questions about anything to get going.

Powodzenia! :1st_place_medal:

Evrey commented 1 year ago

Just a side note:

You mentioned other data structures like ropes that could be easily lexed after a rewrite. I just wanna tune in as someone who also has used Logos a few times to lex streams of binary data. For now this is largely undocumented magic, but i think it’s useful enough to advertise this in the future, and also design the Regex-replacing (yes, thanks!) DSL in a way that is binary-data-friendly.

I’ve used Logos more often to verify arbitrary user inputs than i have to perform actual lexing for some parser.

maciejhirsz commented 1 year ago

DSL in a way that is binary-data-friendly.

Well, the good thing about going that route is is that macros allow for a lot of neat syntax that's not easily done in regex. We could potentially abuse hexadecimal integer literals like 0xDEADBEEF to denote a byte slice &[0xDE, 0xAD, 0xBE, 0xEF] (on top of also using byte string literals like b"...").

Evrey commented 1 year ago

Perfect. =)

That DSL has pretty much all Rust tokens at its disposal. It could look like EBNF, ABNF, Rosie, what have you. It does rise the question, what »subpatterns« would look like in the new DSL, for i use subpatterns heavily. But as you mentioned, these are all questions for after the rewrite.

jeertmans commented 1 year ago

As mentioned in a previous email we exchanged @maciejhirsz, I’d love to give some help too :)

thanks for the detailed message above as well as the one in the release :)

uael commented 1 year ago

@maciejhirsz I have some old WIP in this fork https://github.com/uael/logos. I hacked bases for regex group and other experiments like sublexers or error reporting with expected set of chars. Feel free to do whatever you want with it!

maciejhirsz commented 1 year ago

@uael cheers, will have a look around.

Sublexing will hopefully become much easier once we can remove this cursed field in the rewrite. It should be trivial to do something like:

fn foo(lex: &mut Lexer<OuterToken>) {
    let lex: &mut Lexer<InnerToken> = lex.as();
}

Or even:

fn foo(lex: &mut Lexer<OuterToken>) {
    // Same as .next(), but but with a different token type
    let inner = lex.sublex::<InnerToken>();
}

We could even go as far and remove the Token generic from Lexer altogether, and instead make it turn into an iterator of a concrete token type via the IntoIterator trait.

There is Lexer::morph available currently, but it's a bit painful to use.

maciejhirsz commented 1 year ago

@agluszak have you started any work on this yet? I think I have a big enough time window to dedicate to it so I could at least start it.

jeertmans commented 1 year ago

On a side note @maciejhirsz, but related to this issue and to

If there is a living soul willing to undertake the big rewrite and help push this along faster, do reach out. I can at very least help explain what the hell current codegen is doing. I'd be also happy to give a person or two the ability to review and merge community PRs.

in v0.13, I'd be happy to help you maintain this project :-) (so being able to review issues and PRs)

maciejhirsz commented 1 year ago

@jeertmans thanks, everything you did so far was stellar, I've sent you an invite!

jeertmans commented 1 year ago

Nice, thank you, @maciejhirsz! I think finishing up #301 will improve future PRs, so I'll have a look at Clippy's warnings when I have time to make the CI happy.

Then, I would probably want to work on #311

Once that CI/doc step is done, I would be happy to help on the more technical parts (i.e., the macro), if you ever need help on that :-)

DasLixou commented 11 months ago

🙋 is there any issue / project you guys are working on where I could volunteer?

jeertmans commented 7 months ago

Hey @agluszak and @DasLixou, if you still want to help, please feel free to read through old issue and comment if they should be closed or how you could fix them. Also, v0.13 listed a few important ideas.

Do not forget to checkout latest version, i.e., v0.14!