softdevteam / grmtools

Rust grammar tool libraries and binaries
Other
507 stars 31 forks source link

Prefixes in LEX files don't seem to work in my case #467

Closed Lurgrid closed 1 month ago

Lurgrid commented 1 month ago

Hello,

I'm making an ICAL file parser which respects a simpler format and I need to use prefixes to detect strings in ICAL files but it doesn't seem to work in my case, I don't know if I've misunderstood how to use it or not

Here's my LEX ``` %x SEP %% \: "SEP" BEGIN "BEGIN" END "END" VCALENDAR "VCALENDAR" METHOD "METHOD" REQUEST "REQUEST" PRODID "PRODID" VERSION "VERSION" CALSCALE "CALSCALE" GREGORIAN "GREGORIAN" VEVENT "VEVENT" DTSTAMP "DTSTAMP" DTSTART "DTSTART" DTEND "DTEND" SUMMARY "SUMMARY" LOCATION "LOCATION" DESCRIPTION "DESCRIPTION" UID "UID" CREATED "CREATED" LAST-MODIFIED "LAST-MODIFIED" SEQUENCE "SEQUENCE" [0-9]{8}T[0-9]{6}Z "DATE" [1-9][0-9]* "NUM" [1-9][0-9]*\.[0-9]* "FLOAT" [^\r\n]+ "STRING" [\r\n]+ ; ```
Here's my Yacc ``` %start Cal %avoid_insert "BEGIN" %avoid_insert "END" %avoid_insert "VCALENDAR" %avoid_insert "METHOD" %avoid_insert "REQUEST" %avoid_insert "PRODID" %avoid_insert "VERSION" %avoid_insert "CALSCALE" %avoid_insert "GREGORIAN" %avoid_insert "VEVENT" %avoid_insert "DTSTAMP" %avoid_insert "DTSTART" %avoid_insert "DTEND" %avoid_insert "SUMMARY" %avoid_insert "LOCATION" %avoid_insert "DESCRIPTION" %avoid_insert "UID" %avoid_insert "CREATED" %avoid_insert "LAST-MODIFIED" %avoid_insert "SEQUENCE" %avoid_insert "NUM" %avoid_insert "FLOAT" %avoid_insert "DATE" %avoid_insert "STRING" %avoid_insert "SEP" %% Cal -> (): 'BEGIN' 'SEP' 'VCALENDAR' 'METHOD' 'SEP' 'REQUEST' 'PRODID' 'SEP' 'STRING' 'VERSION' 'SEP' 'FLOAT' 'CALSCALE' 'SEP' 'GREGORIAN' LEvent 'END' 'SEP' 'VCALENDAR' {} ; LEvent -> (): %empty {} | Event LEvent {} ; Event -> (): 'BEGIN' 'SEP' 'VEVENT' 'DTSTAMP' 'SEP' 'DATE' 'DTSTART' 'SEP' 'DATE' 'DTEND' 'SEP' 'DATE' 'SUMMARY' 'SEP' 'STRING' 'LOCATION' 'SEP' 'STRING' 'DESCRIPTION' 'SEP' 'STRING' 'UID' 'SEP' 'STRING' 'CREATED' 'SEP' 'DATE' 'LAST-MODIFIED' 'SEP' 'DATE' 'SEQUENCE' 'SEP' 'NUM' 'END' 'SEP' 'VEVENT' {} ; %% ```
Here's my test file ``` BEGIN:VCALENDAR METHOD:REQUEST PRODID:toto VERSION:2.0 CALSCALE:GREGORIAN END:VCALENDAR ```

And here's the error I get when I test

Lexing error at line 3 column 8.
ltratt commented 1 month ago

You can use lrlex as a binary to test the lex file alone. In this case, line 3 column 8 is because (AFAICS) you don't have a lexing rule that matches toto.

Lurgrid commented 1 month ago

I don't understand why the rule <SEP>[^\r\n]+ "STRING" isn't supposed to match the string toto because it will have read SEP and it's a non-empty string of characters that isn't a carriage return or end of line.

ltratt commented 1 month ago

I don't know what <SEP> is supposed to be. It may be lex syntax we don't support. I think lrlex will treat it as a literal <SEP> but I might be wrong.

Lurgrid commented 1 month ago

As far as I know, to use a state in LEX, you have to add %x STATE, where STATE is your state, at the beginning of the file. Then use BEGIN(STATE) in one of the rules. Here's an example:

%x SEP
%%
: { BEGIN(SEP); return SEP; }
PRODID { return PRODID; }
<SEP>[^\r\n]+ { BEGIN(INITIAL); return STRING; }
%%

So given that you say this in your book “Lex uses a special action expression BEGIN(state) to switch to the named state. grmtools lex files use a token name prefix.”

Here's the code I would have done to redo the previous LEX code

%x SEP
%%
: "SEP"
PRODID “PRODID”
<SEP>[^\r\n]+ “STRING”

In my code SEP just stands for “:”

I think I must have misunderstood something

ltratt commented 1 month ago

Ah, I'm not familiar with the state stuff. @ratmice might understand that part better than I do.

ratmice commented 1 month ago

Blurry eyed and half asleep, but looking at it I don't see where you are entering the SEP state, the rule \: "SEP" doesn't do so, it returns a "SEP" token.

Because lrlex doesn't run code actions, it uses a notion of a state operator to begin and end states. To actually enter the SEP state you'll want a rule like : <+SEP>;

there is an example in lrpar/examples/start_states

Edit: I really need to work on adding a section to the book for this.

Lurgrid commented 1 month ago

Ohh okay, thanks a lot! So my LEX should be like this

%x SEP

%%
: <+SEP>"SEP"

BEGIN "BEGIN"
END "END"

<SEP>VCALENDAR <-SEP>"VCALENDAR"

METHOD "METHOD"
PRODID "PRODID"
VERSION "VERSION"
CALSCALE "CALSCALE"

<SEP>VEVENT <-SEP>"VEVENT"

DTSTAMP "DTSTAMP"
DTSTART "DTSTART"
DTEND "DTEND"
SUMMARY "SUMMARY"
LOCATION "LOCATION"
DESCRIPTION "DESCRIPTION"
UID "UID"
CREATED "CREATED"
LAST-MODIFIED "LAST-MODIFIED"
SEQUENCE "SEQUENCE"

<SEP>REQUEST <-SEP>"REQUEST"
<SEP>GREGORIAN <-SEP>"GREGORIAN"

<SEP>[0-9]{8}T[0-9]{6}Z <-SEP>"DATE"
<SEP>[1-9][0-9]* <-SEP>"NUM"
<SEP>[1-9][0-9]*\.[0-9]* <-SEP>"FLOAT"

<SEP>[^\r\n]+ <-SEP>"STRING"

[\r\n]+ ;
ratmice commented 1 month ago

Indeed that look more like what I would expect, I've used ... <-SEP>"DATE" syntax before so I would expect <+SEP>"SEP" to work, but looking through the parsers I've written using start states, I don't see any examples where I've entered a state and returned a token so definitely let us know if you run into anything unexpected.

Lurgrid commented 1 month ago

<+SEP>“SEP” works very well but it's true that I don't need to send it to my parser

ratmice commented 1 month ago

Cool, well sounds to me like this is fixed. Feel free to reopen if i'm mistaken, or open another issue if you run into anything else.