Closed scossu closed 2 years ago
Hi! You are right, it happens because of greediness. You can restructure the lexer to eat one LCHAR
at a time and loop, this way the terminating three quotes will take prcedence. See the example below (I assumed null-terminated strings, so I added exclusion of null in the middle of a string as well --- but this is unrelated).
#include <assert.h>
#include <stdio.h>
int lex(const char *s) {
const char *YYCURSOR = s, *YYMARKER;
/*!re2c
re2c:yyfill:enable = 0;
re2c:define:YYCTYPE = "unsigned char";
re2c:encoding:utf8 = 1;
HEX = [\x30-\x39\x41-\x46];
CHAR_BASE = "\\u" HEX{4} | "\\U" HEX{8} | '\\' | [\U0000005D-\U0010FFFF];
CHARACTER = CHAR_BASE | [\x20-\x5B];
ECHAR = CHARACTER | ([\\] [tnr]) | [\x00];
LCHAR = ECHAR | ([\\] ["]) | [\t\n\r];
*/
int count = 0;
space:
/*!re2c
* { return -1; }
[\x22]{3} { goto lchar; }
[\n] { goto space; }
[\x00] { return count; }
*/
lchar:
/*!re2c
* { return -1; }
[\x00] { return -2; }
[\x22]{3} { ++count; goto space; }
LCHAR { goto lchar; }
*/
}
int main() {
assert(lex("\"\"\"one\"\"\"") == 1);
assert(lex("\"\"\"one\"\"\"\n\"\"\"two\"\"\"") == 2);
assert(lex("\"\"\"one\"\"\"\n\"\"\"two\"\"\"\n\"\"\"th\\\"ree\"\"\"") == 3);
assert(lex("\"\"\"unterminated\"\"") == -2);
return 0;
}
I don't think it's possible to do what you want in one lexeme --- I can imagine it if re2c supported negation operator (which it doesn't), but even so the resulting automaton would be unnecessarily large due to the necessity to unfold counted repetition of quotes.
By the way you can also use start conditions to write multiple lexer blocks as one.
Thanks! I will try to implement your solution, and look into starting conditions which I haven't yet grasped completely.
I have the following regular expression chain:
This is meant to match a Turtle long string which is enclosed in triple double quotes and may contain individual double quotes.
I have tried to keep the syntax as close to the spec but it's not working as expected. E.g. it matches
as one token.
This is probably because
[\x22]{3} LCHAR* [\x22]{3}
will keep eating up triple quotes as individual ones, until it finds the final triple quote from the second string.Is it possible to specify a non-greedy operator, or work around that in some way?
Thanks.