skvadrik / re2c

Lexer generator for C, C++, Go and Rust.
https://re2c.org
Other
1.06k stars 169 forks source link

First char always be skipped. #480

Closed cathaysia closed 2 months ago

cathaysia commented 2 months ago

I am using re2rust to parse a string. however the first char always be skiped. did anyone know what happend about this?

Here is my code:

pub fn match_str(pat: &[u8]) -> Result<bool, ()> {
    let mut cursor = 0usize;

    loop {
        /*!re2c
            re2c:define:YYCTYPE     = u8;
            re2c:define:YYPEEK      = "pat[cursor]";
            re2c:define:YYSKIP      = "cursor += 1;";
            re2c:yyfill:enable      = 0;
            * {
                println!("{}", pat[cursor] as char);
            }
        */
    }
    Ok(false)
}
skvadrik commented 2 months ago

This is not surprising, as only rule you have is the default rule * that always consumes a single code unit, as documented (e.g. here in the "regular expressions" section).

If you add other rules, they will be matched with higher priority than *, and the first code unit won't be skipped.

If you need a rule that matches empty input, use "" (but then you may have an eternal loop if no other rules match, as the lexer will not make any progress on each step).

Have you looked at the examples?

cathaysia commented 2 months ago

I tried add a new rule :

        /*!re2c
            re2c:define:YYCTYPE     = u8;
            re2c:define:YYPEEK      = "pat[cursor]";
            re2c:define:YYSKIP      = "cursor += 1;";
            re2c:yyfill:enable      = 0;

            [a-zA-Z]{
                state = WildChar::Char(pat[cursor]);
                println!("{}", pat[cursor] as char);
                break;
            }
            * {
                state = WildChar::Char(pat[cursor]);
                println!("any: {}", pat[cursor] as char);
                break;
            }
        */

But the result is same.

Have you looked at the examples?

Yes, I need these examples :). I am not find them at office website. Thanks

cathaysia commented 2 months ago

Have you looked at the examples?

Sorry, I misunderstanding your meaning. I read the most part of http://re2c.org/manual/manual_rust.html and the http://re2c.org/examples/rust/real_world/example_c.html example.

The most part of my lexer had been done. The only remains is the first char be skipped. I'm not too familiar with how re2c works, and I tried using re2c instead of re2rust. But the results are similar. Therefore I suspect that some part I skipped while reading the documentation is causing this problem.

skvadrik commented 2 months ago

The most part of my lexer had been done. The only remains is the first char be skipped.

Can you post a minimal complete example (one that I can compile and run) and instructions what you expect as the output and what you have? Then I may be able to help. Otherwise I don't understand why this is a problem for you.

I'm not too familiar with how re2c works, and I tried using re2c instead of re2rust. But the results are similar.

right, re2c and re2rust should behave identically in this regard.

Therefore I suspect that some part I skipped while reading the documentation is causing this problem.

This behaviour is certainly expected and documented, and it should not be a problem. However, if you need a default rule that does not consume any input, use empty string "" (but beware that this may introduce eternal loops into your program).

cathaysia commented 2 months ago

Here is my code: a sqlmatch:

#[derive(Debug, Clone)]
enum WildChar {
    RepeatAny, // %
    Any, // _
    Choose(bool, usize, usize), // [a-zA-Z] or [^a-z] or [!a-z]
    Range(u8, u8), // a-z or A-Z
    Char(u8), // any char
    End, // EOF
}

/// # Errors
/// return error if pat is invalid.
pub fn sqlmatch(pat: &[u8], text: &[u8]) -> Result<bool, ()> {
    let mut cursor = 0usize;
    let mut text_cursor = 0usize;

    loop {
        let mut state = WildChar::End;
        let mut marker = cursor;
        let (mut t1, t2);
        /*!stags:re2c format = 'let mut @@{tag} = 0;'; */
        /*!re2c
            re2c:define:YYCTYPE     = u8;
            re2c:define:YYPEEK      = "match cursor < pat.len() { true => pat[cursor], false => break}";
            re2c:define:YYSKIP      = "cursor += 1;";
            re2c:define:YYBACKUP    = "marker = cursor;";
            re2c:define:YYRESTORE   = "cursor = marker;";
            re2c:define:YYSTAGP     = "@@{tag} = cursor;";
            re2c:define:YYSHIFTSTAG = "@@{tag} -= -@@{shift}isize as usize;";
            re2c:tags               = 1;
            re2c:yyfill:enable      = 0;
            re2c:define:YYLESSTHAN  = "cursor >= pat.len()";
            re2c:eof = 0;

            neg_choose = "["[\\^!][^\]]*"]";
            choose = "["[^\]]*"]";
            alpha = [a-zA-Z];

            $ {
                state = WildChar::End;
                break;
            }
            "%" {
                state = WildChar::RepeatAny;
                break;
            }
            "_" {
                state = WildChar::Any;
                break;
            }
            @t1 neg_choose @t2 {
                state = WildChar::Choose(true, t1 + 2, t2 - 1);
                break;
            }
            @t1 choose @t2 {
                state = WildChar::Choose(false, t1 + 1, t2 - 1);
                break;
            }
            @t1 alpha "-" alpha @t2 {
                state = WildChar::Range(pat[t1], pat[t2 - 1]);
                break;
            }
            alpha {
                println!("{}", pat[cursor] as char);
                state = WildChar::Char(pat[cursor]);
                break;
            }
            * {
                unreachable!();
            }
        */

        match state {
            WildChar::End => {
                break;
            }
            _=> {}
        }
    }

    Ok(false)
}

fn main() {
    sqlmatch(b"abcd", b"abccc").unwrap();
}

Here I hope sqlmatch will print

a
b
c
d

current it is:

b
c
d

I deleted some logic that is not related to re2c.

Specifically, my purpose is as follows:

sqlmatch accepts two parameters: pattern and text. I use re2c to identify tokens in pattern in a loop, and then match text according to tokens. If text_cursor reaches the end when matching WildChar::End, the match is considered successful.

skvadrik commented 2 months ago

Ah, I see it now. Cursor points at the next character past the last matched one, so you should change your code like this:

            alpha {                                                                                                                                                                                                                           
                println!("{}", pat[cursor - 1] as char);                                                                                                                                                                                      
                state = WildChar::Char(pat[cursor - 1]);                                                                                                                                                                                      
                break;                                                                                                                                                                                                                        
            }                                                                                                                                                                                                                                 

Also, change YYPEEK definition to return zero instead of breaking out of the loop, so that the lexer has a chance to correctly match the end of input:

            re2c:define:YYPEEK = "match cursor < pat.len() { true => pat[cursor], false => 0}";                                                                                                                                          

With these modifications I see your expected output.

cathaysia commented 2 months ago

yes. it works!!!

Thank you very much for your help. :)