Open agluszak opened 1 year ago
Thank you!
So I had a bit of a think about this, I think the best way to proceed that will futureproof the rewrite for inlining would be to define states using a macro_rules!
macro, something like:
fn lex(lex: &mut ::logos::Lexer<'s, Self>) {
use logos::internal::{CallbackResult, LexerInternal};
enum State {
GoTo1,
GoTo2,
GoTo3,
}
macro_rules! goto {
(GoTo1, $lex:ident) => {/* ... */};
(GoTo2, $lex:ident) => {/* ... */};
(GoTo3, $lex:ident) => {/* ... */};
}
let mut state = State::Goto3;
// Assuming Goto3 is the initial state
goto!(GoTo3, lex);
loop {
match state {
State::GoTo1 => goto!(GoTo1, lex),
State::GoTo2 => goto!(GoTo2, lex),
State::GoTo3 => goto!(GoTo3, lex),
}
}
}
Just passing around the lex
ident is likely not enough, I think we'll need at least one, possibly two variables in stack declared at the top of the function body to account for:
usize
value for lookahead that would replace the need to duplicate states possible combination of at
and with
that current macro is doing, as in:
#[inline]
fn goto3_at1_with3<'s>(lex: &mut Lexer<'s>) {
match lex.read_at::<&[u8; 2usize]>(1usize) {
Some(b"-z") => {
lex.bump_unchecked(3usize);
goto1_x(lex)
}
_ => _error(lex),
}
}
This should make the whole codegen much easier to reason about and debug, particularly in issues like #279. The whole passing context around can potentially be scrapped.
For starters I'd say do not care about performance. When I first got the 0.10 rewrite working it was chugging at maybe 20% of what it ended up running at eventually, so feel free to skip any complexity you feel makes things difficult, we can always add optimizations later once tests are green. Even then I think I'd be fine if the final product of the rewrite is tiny bit slower if it means it's actually 100% correct.
Let me know if you need any more pointers, or if you have any questions about anything to get going.
Powodzenia! :1st_place_medal:
Just a side note:
You mentioned other data structures like ropes that could be easily lexed after a rewrite. I just wanna tune in as someone who also has used Logos a few times to lex streams of binary data. For now this is largely undocumented magic, but i think it’s useful enough to advertise this in the future, and also design the Regex-replacing (yes, thanks!) DSL in a way that is binary-data-friendly.
I’ve used Logos more often to verify arbitrary user inputs than i have to perform actual lexing for some parser.
DSL in a way that is binary-data-friendly.
Well, the good thing about going that route is is that macros allow for a lot of neat syntax that's not easily done in regex. We could potentially abuse hexadecimal integer literals like 0xDEADBEEF
to denote a byte slice &[0xDE, 0xAD, 0xBE, 0xEF]
(on top of also using byte string literals like b"..."
).
Perfect. =)
That DSL has pretty much all Rust tokens at its disposal. It could look like EBNF, ABNF, Rosie, what have you. It does rise the question, what »subpatterns« would look like in the new DSL, for i use subpatterns heavily. But as you mentioned, these are all questions for after the rewrite.
As mentioned in a previous email we exchanged @maciejhirsz, I’d love to give some help too :)
thanks for the detailed message above as well as the one in the release :)
@maciejhirsz I have some old WIP in this fork https://github.com/uael/logos. I hacked bases for regex group and other experiments like sublexers or error reporting with expected set of chars. Feel free to do whatever you want with it!
@uael cheers, will have a look around.
Sublexing will hopefully become much easier once we can remove this cursed field in the rewrite. It should be trivial to do something like:
fn foo(lex: &mut Lexer<OuterToken>) {
let lex: &mut Lexer<InnerToken> = lex.as();
}
Or even:
fn foo(lex: &mut Lexer<OuterToken>) {
// Same as .next(), but but with a different token type
let inner = lex.sublex::<InnerToken>();
}
We could even go as far and remove the Token
generic from Lexer
altogether, and instead make it turn into an iterator of a concrete token type via the IntoIterator
trait.
There is Lexer::morph
available currently, but it's a bit painful to use.
@agluszak have you started any work on this yet? I think I have a big enough time window to dedicate to it so I could at least start it.
On a side note @maciejhirsz, but related to this issue and to
If there is a living soul willing to undertake the big rewrite and help push this along faster, do reach out. I can at very least help explain what the hell current codegen is doing. I'd be also happy to give a person or two the ability to review and merge community PRs.
in v0.13, I'd be happy to help you maintain this project :-) (so being able to review issues and PRs)
@jeertmans thanks, everything you did so far was stellar, I've sent you an invite!
Nice, thank you, @maciejhirsz! I think finishing up #301 will improve future PRs, so I'll have a look at Clippy's warnings when I have time to make the CI happy.
Then, I would probably want to work on #311
Once that CI/doc step is done, I would be happy to help on the more technical parts (i.e., the macro), if you ever need help on that :-)
🙋 is there any issue / project you guys are working on where I could volunteer?
Hey @agluszak and @DasLixou, if you still want to help, please feel free to read through old issue and comment if they should be closed or how you could fix them. Also, v0.13 listed a few important ideas.
Do not forget to checkout latest version, i.e., v0.14!
I'd like to help with the rewrite mentioned here: https://github.com/maciejhirsz/logos/releases/tag/v0.13 :) I already have some experience with logos' internals, at least the "frontend" part.