osa1 / lexgen

A fully-featured lexer generator, implemented as a proc macro
MIT License
63 stars 7 forks source link

Implement reset_match method #18

Closed sahandevs closed 2 years ago

sahandevs commented 2 years ago

now I think about it, maybe switch_and_reset is a better name for it(?)

osa1 commented 2 years ago

Common up ... So that they will call the same code generator for "continue" semantic action.

I've done this as a part of another PR, which is merged to main.

sahandevs commented 2 years ago

@osa1 Thank you for reviewing my PR. sorry I'm answering this late, I was trying out your idea by implementing the ModSecurity's lexer (here)

I'll change my lexer based on your changes in main branch later (to test them) and close this PR if it's ok with you.

osa1 commented 2 years ago

No worries @sahandevs. Feel free to ping if you work on or use lexgen and need anything.

sahandevs commented 2 years ago

after re-reading your comment and #12, you are right this PR doesn't fix that issue, but this PR fixes my own issue. I updated the PR and added a test for it.

This is my specific problem if you're curious. it doesn't add anything new:

In my lexer (here I need to reset the match in rules that aren't Init. I took an approach to write my lexer by creating many mini-lexers (rules) to handle special cases (modsec is very weird! whitespace, escaping and even name of a variable affects the lexing (XML:/<xpath>, OTHERS:/<regex pattern>/. for XML we need one forward slash but for others we need two.)

one of the reasons that I need this feature is for string formats in modsec. something like this syntax in js:

const a = `test ${1 +1}`

in modsec is like this:

# var form:
VAR:/test %{a} test/ # escapes with `\/`
# operator form:
"@eq<SINGLE_SPACE> test %{a}" # escapes with `\"`
# action form with `'`
"setvar:'a=test %{a}' # escapes with `\"` and `\'`
# action form without `'`
"setvar:'a=test %{a}' # escapes with `\"` and `\,`

as you can see escaping rule differs in every context so before entering the rule FormatString we need to say what is escaping chars and which rule we should follow after these chars appear and the user didn't escape them (here). for example in setvar:'a=test %{a}' after we see a , we should follow rule ActionsInner and if we see a " we should follow rule Init

I implemented this escaping logic here by waiting for a \\ and peek the next char to see if it's in the escape list (if not we just throw an error) then we enable a flag and forget about the \\ match (by calling the switch_and_reset_match). after that, I have a _ pattern here to decide whether to exit the rule or just return a token based on the flag.

albeit this is not the only use case and as you can see in the full code I used it almost everywhere!

osa1 commented 2 years ago

OK, I think providing a way to reset the match makes sense. Instead of having two switch methods, I think I would find it more intuitive to have a reset_match method that I can call before switch to clear the current match. I could also use it to reset the match without switching. I could use switch_and_reset_match with the current rule set to reset the match and continue, but it seems less intuitive to me.

Any opinions on this?

sahandevs commented 2 years ago

I also think something like reset_match is better, but I don't think we have access to the current_match_start in impl<'lexer, 'input> #handle_type_name<'lexer, 'input> {....

osa1 commented 2 years ago

Good point. I was thinking about eliminating the handle structs completely and passing a reference to the lexer struct to semantic actions to simplify things. This seems like another motivation to do that refactoring. If we do that then all of the lexer state will be available in methods like switch or reset_match.

Let me try to do that and I'll ping you.

osa1 commented 2 years ago

@sahandevs if you merge main (or rebase) you should be able to implement reset_match method now.

Btw, feel free to merge main instead of rebasing and squashing your branch for every update. I'm using GitHub's "squash and merge" feature so it will be merged as one commit.

sahandevs commented 2 years ago

@osa1 Done! It's way better now!

sahandevs commented 2 years ago

We need a second rule because the Init rule resets match on every switch or continue_ call. If I do this:

rule Init {
            'c' => |lexer| {
                if lexer.state().enable_reset_match {
                    lexer.reset_match();
                }
                lexer.continue_()
            },

            "!" => |lexer| {
                lexer.reset_match();
                let enable_reset_match = &mut lexer.state().enable_reset_match;
                *enable_reset_match = !*enable_reset_match;
                lexer.continue_()
            },

            ['d' 'e']+ => |lexer| {
                let s = lexer.match_();
                lexer.return_(s)
            },

            $ = "<>",
}

this fails:

let mut lexer = Lexer::new("ccdeed!ccdeed");
assert_eq!(next(&mut lexer), Some(Ok("ccdeed"))); // ❌ left == "deed"
assert_eq!(next(&mut lexer), Some(Ok("deed"))); // ✔️
assert_eq!(next(&mut lexer), Some(Ok("<>"))); // ✔️
osa1 commented 2 years ago

Makes sense, thanks for the explanation and the PR!