skvadrik / re2c

Lexer generator for C, C++, Go and Rust.
https://re2c.org
Other
1.07k stars 169 forks source link

What about define standard name for some states? #389

Closed krishna116 closed 2 years ago

krishna116 commented 2 years ago

This is the code I sketched to get a list of tokens from a string using re-entry mode. I using this code for example.

//compile-command: re2c test4.lex --storable-state -o test4.c
#include <stdio.h>
#include <string.h>

void print(const char* str, int size);

typedef enum
{
    LexerError          = -2,
    LexerNeedMoreInput  = -1,
    LexerOk             = 0,
    LexerTokenNumber    = 1,
    LexerTokenChars     = 2,
    LexerTokenSpaces    = 3
}LexerReturnCode;

typedef struct
{
    const char* str;
    const char* limit;
    const char* maker;
    int state;
}BufferInfo;

int lex(BufferInfo* bufInfo, int* yyleng, const char** yytext) 
{
    char yych;

    for(;;)
    {
        *yytext = bufInfo->str;
    /*!getstate:re2c*/
    /*!re2c
        re2c:api:style          = free-form;
        re2c:define:YYCTYPE     = "char";
        re2c:define:YYCURSOR    = "bufInfo->str";
        re2c:define:YYMARKER    = "bufInfo->maker";
        re2c:define:YYLIMIT     = "bufInfo->limit";
        re2c:define:YYGETSTATE  = "bufInfo->state";
        re2c:define:YYSETSTATE  = "bufInfo->state = 0;";
        re2c:define:YYFILL      = "return LexerNeedMoreInput;";

        number = [0-9]+;
        chars  = [a-zA-Z]+;
        spaces = [ \t]+;

        number { 
                    *yyleng = bufInfo->str - *yytext;
                    return LexerTokenNumber;
               }
        chars  { 
                    *yyleng = bufInfo->str - *yytext;
                    return LexerTokenChars;
               }
        spaces {
                    *yyleng = bufInfo->str - *yytext;
                    return LexerTokenSpaces;
               }
        [\x00] {
                    return LexerOk;
               }
        *      { 
                    return LexerError; 
               }
    */
    }

    return LexerOk;
}

int main()
{
    const char buffer[] = "1234 567 abc 89 def";
    BufferInfo bufInfo;
    bufInfo.str = buffer;
    //bufInfo.limit = buffer + strlen(buffer);
    bufInfo.limit = buffer + strlen(buffer) + 1;
    bufInfo.state = -1;

    int ret = 0;
    const char* yytext = NULL;
    int yyleng = 0;
    while((ret = lex(&bufInfo, &yyleng, &yytext)) > 0)
    {
        printf("token-id = %d, ", ret);  // print token-id;
        print(yytext, yyleng);          // print token-str;
    }

    printf("final return %d", ret);

    return 0;
}

void print(const char* str, int size)
{
    if(str == NULL) return;

    printf("token = [");
    for(int i = 0; i < size; i++) printf("%c", *(str+i));
    printf("]\n");
}

the above code's output is: code-sample

So the advice is:

1, in the main function the initial bufInfo.state = -1, it could be: #define RE2C_STATE_INIT -1

2, in the lex function, I set "bufInfo->state = 0" to restore, because I guess it is always the begining/start/nothing-matched state, if so it could be: #define RE2C_STATE_BEGIN 0

3, the code may be not good, if so I always like to listen the advice.

thank you.

skvadrik commented 2 years ago

You should use @@ instead of 0 in YYSETSTATE:

    re2c:define:YYSETSTATE  = "bufInfo->state = @@;";

The point here is that only re2c knows what the correct state is, but not the user. Each time re2c generates YYSETSTATE, it substitutes @@ with the correct state. The actual state is different for different YYSETSTATE invocations in the lexer.

As for -1, it is always used as the default state in re2c. Maybe we should say this more explicitly in the docs.

See this example if you haven't already and the description of re2c:api:sigil condiguration.

krishna116 commented 2 years ago

but if I using this: re2c:define:YYSETSTATE = "bufInfo->state = @@;"; the output is a dead loop.

skvadrik commented 2 years ago

Can you elaborate on where you have an infinite loop? Ideally provide an example that shows a hanging program.

I did test your example with @@ and it finished normally. In fact, there was no difference in the output because in your original example the whole input fits into the buffer, so there is no need for refilling it (so YYSETSTATE didn't matter).

krishna116 commented 2 years ago

I had add a break in the while loop, this is the output: test4-loop

I original code is indeed using: re2c:define:YYSETSTATE = "bufInfo->state = @@;"; so I'm confused as you say there is no problem.

krishna116 commented 2 years ago

I have attached the re2c generated code here, may be you can diff the difference. src.zip

skvadrik commented 2 years ago

Oh, I know what the problem is: I was testing with the most recent re2c from git master branch, which has this commit: https://github.com/skvadrik/re2c/commit/2c0dd72332c2d23270179d8c75a7ce7f5ae02240. If you read the commit description, it explains why it is necessary to generate YYSETSTATE(-1) in final state. Previous re2c version relied on the user to do this.

If you cannot update re2c, then you can manually add bufInfo->state = -1; in final states before return.

Note that the way you organize the lexer loop is a bit unusual: you return from the lex function to main from every final state, only to reiterate and call the lex function again. It is more convenient to put the lexer loop in the lex function (make it bypass the getstate:re2c block as shown in the example), and let the outer loop in main handle the exceptional situations when the lexer needs more input, or when it encounters an error, or when it terminates successfully.

If you reorganize the lexer loop in lex to bypass getstate:re2c, you won't need YYSETSTATE(-1) in final states and the lexer will be faster, as it will bypass the initial state switch in the main loop. This is precisely the reason to have a separate getstate:re2c block.

skvadrik commented 2 years ago

I attach your example reworked as I suggested: test4.lex.txt. The changes are:

This example works with older re2c versions as well.

krishna116 commented 2 years ago

I have read the commit: 2c0dd72. I feel the same as it is except that -1 is used as not only initial state but also begining state. so the "bufInfo->state = @@;"; is either last interrupt state or initial state? may be let use know what exactly they want is better and this"@@" symbol/place-holder has no semantic means. so it seems introduce such as: @{init/begin}, @{interrupt} standard internal state for using is better.

thank you for the advice . If I need parsing any data package your provided code is good. If I just need splited-tokens send to grammar-parser, the code is becoming more complicate than flex, I alway think what is better design or understandable-concise-way to do it.

I have download and compile latest re2c source code, and using latest re2c.exe is ok, no dead loop happened for my code, in fact I'm learning re2c and writing notes/code/tutoirals for other peole/beginners.

thank you.

krishna116 commented 2 years ago

After some test, I find that option "--storable-state " and this interrupt-state "@@" and YYFILL should be used with all these stuff:

re2c:define:YYMARKER    = "bufInfo->maker";
re2c:define:YYLIMIT     = "bufInfo->limit";
re2c:define:YYGETSTATE  = "bufInfo->state";
re2c:define:YYSETSTATE  = "bufInfo->state = @@;";
re2c:define:YYFILL      = "return LexerNeedMoreInput;";

now I understand: it means it is used to parsing inconsecutive stream buffer-blocks, the lexer will re-entry and shot buffer-block many times. if the buffer is just one-consecutive-one-shot-block, I don't need all those stuff. I just need take care YYCURSER, that's all. so I attched a new modified example, it is more concise. test4-1.zip .

skvadrik commented 2 years ago

Right, you don't need YYFILL unless your input is too large and you need buffering. You can read more about how YYFILL works and when you need it in the manual: https://re2c.org/manual/manual_c.html#buffer-refilling. I also recommend reading the section about end-of-input handling: https://re2c.org/manual/manual_c.html#handling-the-end-of-input.

The --storable-state option (described here: https://re2c.org/manual/manual_c.html#storable-state) is more advanced than YYFILL. It is only needed in cases when the lexer may be interrupted in the middle and resumed later (for example because it has to wait on a socket for more input to appear). The -1 state means that the lexer was not interrupted. That's why state -1 is used both as initial and in the final states, when the lexer has successfully processed a full lexeme and is ready to start processing the next one from the initial state. Other possible values of @@ are various interrupt states.

The simplified lexer test4-1.zip looks good.

in fact I'm learning re2c and writing notes/code/tutoirals for other peole/beginners.

That's awesome, thank you for the effort!

krishna116 commented 2 years ago

ok, I will read the other pasts of the manual in some time, may be I could ask questions in the irc channel is better. thank you.

skvadrik commented 2 years ago

may be I could ask questions in the irc channel is better

You are welcome on IRC (it seems that there is a significant timezone difference, I am in BST timezone).