skvadrik / re2c

Lexer generator for C, C++, D, Go, Haskell, Java, JS, OCaml, Python, Rust, V and Zig.
https://re2c.org
Other
1.11k stars 173 forks source link

Matching keywords with EOF in sentinel mode: seeking efficient solution #491

Closed hannesd closed 3 days ago

hannesd commented 2 months ago

Hello,

I've been using re2c and have encountered a challenge when trying to match keywords followed by either a separator (whitespace or special symbols) or EOF.

I'm using re2c in sentinel with bounds check mode, which means I can't simply match the sentinel character as EOF, since in this mode, matching the sentinel indicates an actual character in the input rather than EOF.

I've tried a few workarounds:

  1. Using a separate condition to enter after matching a keyword, where I can use the $ rule to check for EOF. However, this approach feels cumbersome and may impact performance.

  2. Using multiple rules for each keyword:

    "true" / SEP_AFTER { return Bool(); }
    "true" / ([^] \ SEP_AFTER) { return Error(); }
    "true" { return Bool(); }

    This leads to numerous rules for all keywords and feels inefficient.

  3. Matching any sequence of keyword characters and then checking which keyword it is. This essentially duplicates the lexer's work.

I'm wondering if there's a more elegant solution I'm overlooking, or if there's currently no way to express something like:

"true" / (SEP_AFTER | $) { return Bool(); }

Any insights or suggestions would be appreciated. Thank you for your time and for developing this tool.

skvadrik commented 2 months ago

Can you try something like "true" / SEP_AFTER? { return Bool(); }?

If it doesn't help, can you post a complete example, so I can have a closer look?

hannesd commented 2 months ago

Thanks for the quick reply! Unfortunately, your suggestion does not yield the intended result. Here's an example:

// re2c example.re2c.cpp -o example.cpp && g++ example.cpp -o example && ./example
#include <cassert>
#include <cstdio>
#include <cstring>
#include <iostream>

#define BUFSIZE 4095

struct Input {
    FILE *file;
    char buf[BUFSIZE + 1], *lim, *cur, *mar, *tok, *ctxmar; // +1 for sentinel
    bool eof;
};

static int fill(Input &in) {
    if (in.eof) return 1;

    const size_t shift = in.tok - in.buf;
    const size_t used = in.lim - in.tok;

    // Error: lexeme too long. In real life could reallocate a larger buffer.
    if (shift < 1) return 2;

    // Shift buffer contents (discard everything up to the current token).
    memmove(in.buf, in.tok, used);
    in.lim -= shift;
    in.cur -= shift;
    in.mar -= shift;
    in.ctxmar -= shift;
    in.tok -= shift;

    // Fill free space at the end of buffer with new data from file.
    in.lim += fread(in.lim, 1, BUFSIZE - used, in.file);
    in.lim[0] = 255;
    in.eof = in.lim < in.buf + BUFSIZE;
    return 0;
}

static void lex(Input &in) {
    for (;;) {
        in.tok = in.cur;
    /*!re2c
        re2c:api:style = free-form;
        re2c:define:YYCTYPE     = char;
        re2c:define:YYCURSOR    = in.cur;
        re2c:define:YYMARKER    = in.mar;
        re2c:define:YYCTXMARKER = in.ctxmar;
        re2c:define:YYLIMIT     = in.lim;
        re2c:define:YYFILL      = "fill(in) == 0";
        re2c:eof = 255;

        SEP = [ \t\n\r] ;

        //"keyword" / SEP? { std::cout << "  keyword(SEP)" << std::endl; continue; }

        "keyword" / SEP { std::cout << "  keyword(SEP)" << std::endl; continue; }
        "keyword" ([^] \ SEP) { std::cout << "  ERROR" << std::endl; continue; }
        "keyword" { std::cout << "  keyword(EOF)" << std::endl; continue; }

        SEP { std::cout << "  SEP" << std::endl; continue; }
        * { std::cout << "  discard: '" << in.cur[-1] << "'" << std::endl; continue; }
        $ { std::cout << "  EOF" << std::endl; return; }
    */
    }
}

static void demo(const char *content) {
    std::cout << "Input: \"" << content << "\":" << std::endl;
    const char *fname = "input";

    // Prepare input file: a few times the size of the buffer, containing
    // strings with zeroes and escaped quotes.
    FILE *f = fopen(fname, "w");
    fwrite(content, strlen(content), 1, f);
    fclose(f);

    // Initialize lexer state: all pointers are at the end of buffer.
    Input in;
    in.file = fopen(fname, "r");
    in.cur = in.mar = in.tok = in.lim = in.buf + BUFSIZE;
    in.eof = 0;
    // Sentinel (at YYLIMIT pointer) is set to zero, which triggers YYFILL.
    in.lim[0] = 0;

    // Run the lexer.
    lex(in);

    // Cleanup: remove input file.
    fclose(in.file);
    remove(fname);
}

int main() {
    /*
        Expected output:
            Input: "keyword keyword":
                keyword(SEP)
                SEP
                keyword(EOF)
                EOF
            Input: "keyword":
                keyword(EOF)
                EOF
            Input: "keywordkeyword keyword":
                ERROR
                discard: 'e'
                discard: 'y'
                discard: 'w'
                discard: 'o'
                discard: 'r'
                discard: 'd'
                SEP
                keyword(EOF)
                EOF

        When replacing the three rules with this one rule:
            "keyword" / SEP? { std::cout << "  keyword(SEP)" << std::endl; continue; }

            (as a stand-in for this concept: `"keyword" / (SEP | $)`)

        The last input is parsed as follows (which is not the intended result):
            Input: "keywordkeyword keyword":
                keyword(SEP)
                keyword(SEP)
                SEP
                keyword(SEP)
                EOF

    */

    demo("keyword keyword");
    demo("keyword");
    demo("keywordkeyword keyword");

    return 0;
}
skvadrik commented 2 months ago

Thanks for the complete example. Unfortunately at the moment the end of input marker $ cannot be used as part of a regexp, but here's a few mitigations:

  1. You can get away with two rules instead of three (the only overhead is in readability):

    static void lex(Input &in) {
    for (;;) {
        in.tok = in.cur;
    /*!re2c
        re2c:api:style = free-form;
        re2c:define:YYCTYPE     = char;
        re2c:define:YYCURSOR    = in.cur;
        re2c:define:YYMARKER    = in.mar;
        re2c:define:YYCTXMARKER = in.ctxmar;
        re2c:define:YYLIMIT     = in.lim;
        re2c:define:YYFILL      = "fill(in) == 0";
        re2c:eof = 255;
    
        SEP = [ \t\n\r] ;
    
        "keyword" / SEP? { std::cout << "  keyword" << std::endl; continue; }
        "keyword" [^]    { std::cout << "  ERROR" << std::endl; continue; }
    
        SEP { std::cout << "  SEP" << std::endl; continue; }
        * { std::cout << "  discard: '" << in.cur[-1] << "'" << std::endl; continue; }
        $ { std::cout << "  EOF" << std::endl; return; }
    */
    }
    }
  2. Probably something you've tried before, but you can have a separate block for separator:

    static void lex(Input &in) {
    loop:
    in.tok = in.cur;
    /*!re2c
        re2c:api:style = free-form;
        re2c:define:YYCTYPE     = char;
        re2c:define:YYCURSOR    = in.cur;
        re2c:define:YYMARKER    = in.mar;
        re2c:define:YYCTXMARKER = in.ctxmar;
        re2c:define:YYLIMIT     = in.lim;
        re2c:define:YYFILL      = "fill(in) == 0";
        re2c:eof = 255;
    
        SEP = [ \t\n\r] ;
    
        "keyword" { std::cout << "  keyword" << std::endl; goto sep_or_eof; }
        * { std::cout << "  discard: '" << in.cur[-1] << "'" << std::endl; goto loop; }
        $ { std::cout << "  EOF" << std::endl; return; }
    */
    sep_or_eof:
    /*!re2c
        SEP { std::cout << "  SEP" << std::endl; goto loop; }
        * { std::cout << "  ERROR" << std::endl; goto loop; }
        $ { std::cout << "  EOF" << std::endl; return; }
    */
    }

    You can also make a small function that returns bool instead of the sep_or_eof block.

hannesd commented 1 month ago

Thanks for your insights! It didn't occur to me that the two rules are enough. That's good enough of a simplification for me to be happy. Three rules felt like too much ;)

I previously did something similar to your second suggestion, only I used a condition, not a goto.

If I'm not mistaken your second suggestion and the function idea only work for pull parsers. If the fill event returns from the lex function, you loose the state if you don't use a condition. The downside with the condition is also that you have to take additional steps to temporarily store the token so that the sep_or_eof condition can return it when it sees the correct separator. All doable, but adds a lot of overhead for just handling keyword followed by EOF correctly.

Anyway, happy with the two-rule solution! Thanks a lot!

skvadrik commented 1 month ago

You are welcome!

If I'm not mistaken your second suggestion and the function idea only work for pull parsers. If the fill event returns from the lex function, you loose the state if you don't use a condition. The downside with the condition is also that you have to take additional steps to temporarily store the token so that the sep_or_eof condition can return it when it sees the correct separator. All doable, but adds a lot of overhead for just handling keyword followed by EOF correctly.

You can have a push parser that works with YYFILL and has multiple blocks. You have probably seen the example of a push parser: when the parser is interrupted by YYFILL, it stores its state into a variable (usually a field in a state struct) and returns. Next time the parser is resumed, it looks at the state variable and jumps directly to the label where it left off. If you look at the generated code, you will see these YYFILL labels - they can be in different parser blocks, as long as they are in the same function. I think it's the same idea as your condition, only automated.